Summary This PR implements Spec 049 – Backup/Restore Job Orchestration: all critical Backup/Restore execution paths are job-only, idempotent, tenant-scoped, and observable via run records + DB notifications (Phase 1). The UI no longer performs heavy Graph work inside request/Filament actions for these flows. Why We want predictable UX and operations at MSP scale: • no timeouts / long-running requests • reproducible run state + per-item results • safe error persistence (no secrets / no token leakage) • strict tenant isolation + auditability for write paths What changed Foundational (Runs + Idempotency + Observability) • Added a shared RunIdempotency helper (dedupe while queued/running). • Added a read-only BulkOperationRuns surface (list + view) for status/progress. • Added DB notifications for run status changes (with “View run” link). US1 – Policy “Capture snapshot” is job-only • Policy detail “Capture snapshot” now: • creates/reuses a run (dedupe key: tenant + policy.capture_snapshot + policy DB id) • dispatches a queued job • returns immediately with notification + link to run detail • Graph capture work moved fully into the job; request path stays Graph-free. US3 – Restore runs orchestration is job-only + safe • Live restore execution is queued and updates RestoreRun status/progress. • Per-item outcomes are persisted deterministically (per internal DB record). • Audit logging is written for live restore. • Preview/dry-run is enforced as read-only (no writes). Tenant isolation / authorization (non-negotiable) • Run list/view/start are tenant-scoped and policy-guarded (cross-tenant access => 403, not 404). • Explicit Pest tests cover cross-tenant denial and start authorization. Tests / Verification • ./vendor/bin/pint --dirty • Targeted suite (examples): • policy capture snapshot queued + idempotency tests • restore orchestration + audit logging + preview read-only tests • run authorization / tenant isolation tests Notes / Scope boundaries • Phase 1 UX = DB notifications + run detail page. A global “progress widget” is tracked as Phase 2 and not required for merge. • Resilience/backoff is tracked in tasks but can be iterated further after merge. Review focus • Dedupe behavior for queued/running runs (reuse vs create-new) • Tenant scoping & policy gates for all run surfaces • Restore safety: audit event + preview no-writes Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local> Reviewed-on: #56
79 lines
3.9 KiB
Markdown
79 lines
3.9 KiB
Markdown
# Research: Backup/Restore Job Orchestration (049)
|
||
|
||
This document resolves Phase 0 open questions and records design choices.
|
||
|
||
## Decisions
|
||
|
||
### 1) Run Record storage strategy
|
||
|
||
**Decision:** Reuse existing run-record primitives instead of introducing a brand-new “unified run” subsystem in Phase 1.
|
||
|
||
- Restore + re-run restore + dry-run/preview: use the existing `restore_runs` table / `App\Models\RestoreRun`.
|
||
- Backup set capture-like operations (e.g., “add policies and capture”): reuse `bulk_operation_runs` / `App\Models\BulkOperationRun` (already used for long-running background work like bulk exports) and (if needed) extend it to satisfy FR-002 fields.
|
||
|
||
**Rationale:**
|
||
- The codebase already has multiple proven “run tables” (`restore_runs`, `inventory_sync_runs`, `backup_schedule_runs`, `bulk_operation_runs`).
|
||
- Minimizes migration risk and avoids broad refactors.
|
||
- Lets Phase 1 focus on eliminating inline heavy work while keeping UX consistent.
|
||
|
||
**Alternatives considered:**
|
||
- **Create a new generic `operation_runs` + `operation_run_items` data model** for all queued automation.
|
||
- Rejected (Phase 1): higher migration + backfill cost; high coordination risk across many features.
|
||
|
||
### 2) Status lifecycle mapping
|
||
|
||
**Decision:** Standardize at the *UI + plan* level on `queued → running → (succeeded | failed | partial)` while allowing underlying storage to keep its existing status vocabulary.
|
||
|
||
- `BulkOperationRun.status` mapping: `pending→queued`, `running→running`, `completed→succeeded`, `completed_with_errors→partial`, `failed/aborted→failed`.
|
||
- `RestoreRun.status` mapping will be aligned (e.g., `pending→queued`, `running→running`, etc.) as part of implementation.
|
||
|
||
**Rationale:**
|
||
- Keeps the spec’s lifecycle consistent without forcing an immediate cross-table refactor.
|
||
|
||
**Alternatives considered:**
|
||
- **Rename and normalize all run statuses across all run tables.**
|
||
- Rejected (Phase 1): touches many workflows and tests.
|
||
|
||
### 3) Idempotency & de-duplication
|
||
|
||
**Decision:** Enforce de-duplication for *active* runs via a deterministic key and a DB query gate, with an optional lock for race reduction.
|
||
|
||
- Dedupe key format: `tenant_id + operation_type + target_object_id` (plus a stable hash of relevant payload if needed).
|
||
- Behavior: if an identical run is `queued`/`running`, reuse it and return/link to it; allow a new run only after terminal.
|
||
|
||
**Rationale:**
|
||
- Matches the constitution (“Automation must be Idempotent & Observable”) and aligns with existing patterns (inventory selection hash + schedule locks).
|
||
|
||
**Alternatives considered:**
|
||
- **Cache-only locks** (`Cache::lock(...)`) without persisted keys.
|
||
- Rejected: harder to reason about after restarts; less observable.
|
||
|
||
### 4) Restore preview must be asynchronous
|
||
|
||
**Decision:** Move restore preview generation (“Generate preview” in the wizard) into a queued job which persists preview outputs to the run record.
|
||
|
||
**Rationale:**
|
||
- Preview can require Graph calls and normalization work; it should never block an interactive request.
|
||
|
||
**Alternatives considered:**
|
||
- **Keep preview synchronous** and increase timeouts.
|
||
- Rejected: timeouts, poor UX, and violates FR-001.
|
||
|
||
### 5) Notifications for progress visibility
|
||
|
||
**Decision:** Use DB notifications for state transitions (queued/running/terminal) and keep a Run detail view as the primary progress surface in Phase 1.
|
||
|
||
**Rationale:**
|
||
- Inventory sync + backup schedule runs already use this pattern.
|
||
- Survives page reloads and doesn’t require the user to keep the page open.
|
||
|
||
**Alternatives considered:**
|
||
- **Frontend polling only** (no DB notifications).
|
||
- Rejected: weaker UX and weaker observability.
|
||
|
||
## Clarifications resolved
|
||
|
||
- **SC-003 includes “canceled”** while Phase 1 explicitly has “no cancel”.
|
||
- Resolution for Phase 1 planning: treat “canceled” as out-of-scope (Phase 2+) and map “aborted” (if present) into the `failed` bucket for SC accounting.
|
||
|