TenantAtlas/specs/049-backup-restore-job-orchestration/research.md

78 lines
3.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Research: Backup/Restore Job Orchestration (049)
This document resolves Phase 0 open questions and records design choices.
## Decisions
### 1) Run Record storage strategy
**Decision:** Reuse existing run-record primitives instead of introducing a brand-new “unified run” subsystem in Phase 1.
- Restore + re-run restore + dry-run/preview: use the existing `restore_runs` table / `App\Models\RestoreRun`.
- Backup set capture-like operations (e.g., “add policies and capture”): reuse `bulk_operation_runs` / `App\Models\BulkOperationRun` (already used for long-running background work like bulk exports) and (if needed) extend it to satisfy FR-002 fields.
**Rationale:**
- The codebase already has multiple proven “run tables” (`restore_runs`, `inventory_sync_runs`, `backup_schedule_runs`, `bulk_operation_runs`).
- Minimizes migration risk and avoids broad refactors.
- Lets Phase 1 focus on eliminating inline heavy work while keeping UX consistent.
**Alternatives considered:**
- **Create a new generic `operation_runs` + `operation_run_items` data model** for all queued automation.
- Rejected (Phase 1): higher migration + backfill cost; high coordination risk across many features.
### 2) Status lifecycle mapping
**Decision:** Standardize at the *UI + plan* level on `queued → running → (succeeded | failed | partial)` while allowing underlying storage to keep its existing status vocabulary.
- `BulkOperationRun.status` mapping: `pending→queued`, `running→running`, `completed→succeeded`, `completed_with_errors→partial`, `failed/aborted→failed`.
- `RestoreRun.status` mapping will be aligned (e.g., `pending→queued`, `running→running`, etc.) as part of implementation.
**Rationale:**
- Keeps the specs lifecycle consistent without forcing an immediate cross-table refactor.
**Alternatives considered:**
- **Rename and normalize all run statuses across all run tables.**
- Rejected (Phase 1): touches many workflows and tests.
### 3) Idempotency & de-duplication
**Decision:** Enforce de-duplication for *active* runs via a deterministic key and a DB query gate, with an optional lock for race reduction.
- Dedupe key format: `tenant_id + operation_type + target_object_id` (plus a stable hash of relevant payload if needed).
- Behavior: if an identical run is `queued`/`running`, reuse it and return/link to it; allow a new run only after terminal.
**Rationale:**
- Matches the constitution (“Operations / Run Observability Standard”) and aligns with existing patterns (inventory selection hash + schedule locks).
**Alternatives considered:**
- **Cache-only locks** (`Cache::lock(...)`) without persisted keys.
- Rejected: harder to reason about after restarts; less observable.
### 4) Restore preview must be asynchronous
**Decision:** Move restore preview generation (“Generate preview” in the wizard) into a queued job which persists preview outputs to the run record.
**Rationale:**
- Preview can require Graph calls and normalization work; it should never block an interactive request.
**Alternatives considered:**
- **Keep preview synchronous** and increase timeouts.
- Rejected: timeouts, poor UX, and violates FR-001.
### 5) Notifications for progress visibility
**Decision:** Use DB notifications for state transitions (queued/running/terminal) and keep a Run detail view as the primary progress surface in Phase 1.
**Rationale:**
- Inventory sync + backup schedule runs already use this pattern.
- Survives page reloads and doesnt require the user to keep the page open.
**Alternatives considered:**
- **Frontend polling only** (no DB notifications).
- Rejected: weaker UX and weaker observability.
## Clarifications resolved
- **SC-003 includes “canceled”** while Phase 1 explicitly has “no cancel”.
- Resolution for Phase 1 planning: treat “canceled” as out-of-scope (Phase 2+) and map “aborted” (if present) into the `failed` bucket for SC accounting.