# Research: Backup/Restore Job Orchestration (049) This document resolves Phase 0 open questions and records design choices. ## Decisions ### 1) Run Record storage strategy **Decision:** Reuse existing run-record primitives instead of introducing a brand-new “unified run” subsystem in Phase 1. - Restore + re-run restore + dry-run/preview: use the existing `restore_runs` table / `App\Models\RestoreRun`. - Backup set capture-like operations (e.g., “add policies and capture”): reuse `bulk_operation_runs` / `App\Models\BulkOperationRun` (already used for long-running background work like bulk exports) and (if needed) extend it to satisfy FR-002 fields. **Rationale:** - The codebase already has multiple proven “run tables” (`restore_runs`, `inventory_sync_runs`, `backup_schedule_runs`, `bulk_operation_runs`). - Minimizes migration risk and avoids broad refactors. - Lets Phase 1 focus on eliminating inline heavy work while keeping UX consistent. **Alternatives considered:** - **Create a new generic `operation_runs` + `operation_run_items` data model** for all queued automation. - Rejected (Phase 1): higher migration + backfill cost; high coordination risk across many features. ### 2) Status lifecycle mapping **Decision:** Standardize at the *UI + plan* level on `queued → running → (succeeded | failed | partial)` while allowing underlying storage to keep its existing status vocabulary. - `BulkOperationRun.status` mapping: `pending→queued`, `running→running`, `completed→succeeded`, `completed_with_errors→partial`, `failed/aborted→failed`. - `RestoreRun.status` mapping will be aligned (e.g., `pending→queued`, `running→running`, etc.) as part of implementation. **Rationale:** - Keeps the spec’s lifecycle consistent without forcing an immediate cross-table refactor. **Alternatives considered:** - **Rename and normalize all run statuses across all run tables.** - Rejected (Phase 1): touches many workflows and tests. ### 3) Idempotency & de-duplication **Decision:** Enforce de-duplication for *active* runs via a deterministic key and a DB query gate, with an optional lock for race reduction. - Dedupe key format: `tenant_id + operation_type + target_object_id` (plus a stable hash of relevant payload if needed). - Behavior: if an identical run is `queued`/`running`, reuse it and return/link to it; allow a new run only after terminal. **Rationale:** - Matches the constitution (“Automation must be Idempotent & Observable”) and aligns with existing patterns (inventory selection hash + schedule locks). **Alternatives considered:** - **Cache-only locks** (`Cache::lock(...)`) without persisted keys. - Rejected: harder to reason about after restarts; less observable. ### 4) Restore preview must be asynchronous **Decision:** Move restore preview generation (“Generate preview” in the wizard) into a queued job which persists preview outputs to the run record. **Rationale:** - Preview can require Graph calls and normalization work; it should never block an interactive request. **Alternatives considered:** - **Keep preview synchronous** and increase timeouts. - Rejected: timeouts, poor UX, and violates FR-001. ### 5) Notifications for progress visibility **Decision:** Use DB notifications for state transitions (queued/running/terminal) and keep a Run detail view as the primary progress surface in Phase 1. **Rationale:** - Inventory sync + backup schedule runs already use this pattern. - Survives page reloads and doesn’t require the user to keep the page open. **Alternatives considered:** - **Frontend polling only** (no DB notifications). - Rejected: weaker UX and weaker observability. ## Clarifications resolved - **SC-003 includes “canceled”** while Phase 1 explicitly has “no cancel”. - Resolution for Phase 1 planning: treat “canceled” as out-of-scope (Phase 2+) and map “aborted” (if present) into the `failed` bucket for SC accounting.