TenantAtlas/specs/049-backup-restore-job-orchestration/research.md
ahmido bcf4996a1e feat/049-backup-restore-job-orchestration (#56)
Summary

This PR implements Spec 049 – Backup/Restore Job Orchestration: all critical Backup/Restore execution paths are job-only, idempotent, tenant-scoped, and observable via run records + DB notifications (Phase 1). The UI no longer performs heavy Graph work inside request/Filament actions for these flows.

Why

We want predictable UX and operations at MSP scale:
	•	no timeouts / long-running requests
	•	reproducible run state + per-item results
	•	safe error persistence (no secrets / no token leakage)
	•	strict tenant isolation + auditability for write paths

What changed

Foundational (Runs + Idempotency + Observability)
	•	Added a shared RunIdempotency helper (dedupe while queued/running).
	•	Added a read-only BulkOperationRuns surface (list + view) for status/progress.
	•	Added DB notifications for run status changes (with “View run” link).

US1 – Policy “Capture snapshot” is job-only
	•	Policy detail “Capture snapshot” now:
	•	creates/reuses a run (dedupe key: tenant + policy.capture_snapshot + policy DB id)
	•	dispatches a queued job
	•	returns immediately with notification + link to run detail
	•	Graph capture work moved fully into the job; request path stays Graph-free.

US3 – Restore runs orchestration is job-only + safe
	•	Live restore execution is queued and updates RestoreRun status/progress.
	•	Per-item outcomes are persisted deterministically (per internal DB record).
	•	Audit logging is written for live restore.
	•	Preview/dry-run is enforced as read-only (no writes).

Tenant isolation / authorization (non-negotiable)
	•	Run list/view/start are tenant-scoped and policy-guarded (cross-tenant access => 403, not 404).
	•	Explicit Pest tests cover cross-tenant denial and start authorization.

Tests / Verification
	•	./vendor/bin/pint --dirty
	•	Targeted suite (examples):
	•	policy capture snapshot queued + idempotency tests
	•	restore orchestration + audit logging + preview read-only tests
	•	run authorization / tenant isolation tests

Notes / Scope boundaries
	•	Phase 1 UX = DB notifications + run detail page. A global “progress widget” is tracked as Phase 2 and not required for merge.
	•	Resilience/backoff is tracked in tasks but can be iterated further after merge.

Review focus
	•	Dedupe behavior for queued/running runs (reuse vs create-new)
	•	Tenant scoping & policy gates for all run surfaces
	•	Restore safety: audit event + preview no-writes

Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local>
Reviewed-on: #56
2026-01-11 15:59:06 +00:00

79 lines
3.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Research: Backup/Restore Job Orchestration (049)
This document resolves Phase 0 open questions and records design choices.
## Decisions
### 1) Run Record storage strategy
**Decision:** Reuse existing run-record primitives instead of introducing a brand-new “unified run” subsystem in Phase 1.
- Restore + re-run restore + dry-run/preview: use the existing `restore_runs` table / `App\Models\RestoreRun`.
- Backup set capture-like operations (e.g., “add policies and capture”): reuse `bulk_operation_runs` / `App\Models\BulkOperationRun` (already used for long-running background work like bulk exports) and (if needed) extend it to satisfy FR-002 fields.
**Rationale:**
- The codebase already has multiple proven “run tables” (`restore_runs`, `inventory_sync_runs`, `backup_schedule_runs`, `bulk_operation_runs`).
- Minimizes migration risk and avoids broad refactors.
- Lets Phase 1 focus on eliminating inline heavy work while keeping UX consistent.
**Alternatives considered:**
- **Create a new generic `operation_runs` + `operation_run_items` data model** for all queued automation.
- Rejected (Phase 1): higher migration + backfill cost; high coordination risk across many features.
### 2) Status lifecycle mapping
**Decision:** Standardize at the *UI + plan* level on `queued → running → (succeeded | failed | partial)` while allowing underlying storage to keep its existing status vocabulary.
- `BulkOperationRun.status` mapping: `pending→queued`, `running→running`, `completed→succeeded`, `completed_with_errors→partial`, `failed/aborted→failed`.
- `RestoreRun.status` mapping will be aligned (e.g., `pending→queued`, `running→running`, etc.) as part of implementation.
**Rationale:**
- Keeps the specs lifecycle consistent without forcing an immediate cross-table refactor.
**Alternatives considered:**
- **Rename and normalize all run statuses across all run tables.**
- Rejected (Phase 1): touches many workflows and tests.
### 3) Idempotency & de-duplication
**Decision:** Enforce de-duplication for *active* runs via a deterministic key and a DB query gate, with an optional lock for race reduction.
- Dedupe key format: `tenant_id + operation_type + target_object_id` (plus a stable hash of relevant payload if needed).
- Behavior: if an identical run is `queued`/`running`, reuse it and return/link to it; allow a new run only after terminal.
**Rationale:**
- Matches the constitution (“Automation must be Idempotent & Observable”) and aligns with existing patterns (inventory selection hash + schedule locks).
**Alternatives considered:**
- **Cache-only locks** (`Cache::lock(...)`) without persisted keys.
- Rejected: harder to reason about after restarts; less observable.
### 4) Restore preview must be asynchronous
**Decision:** Move restore preview generation (“Generate preview” in the wizard) into a queued job which persists preview outputs to the run record.
**Rationale:**
- Preview can require Graph calls and normalization work; it should never block an interactive request.
**Alternatives considered:**
- **Keep preview synchronous** and increase timeouts.
- Rejected: timeouts, poor UX, and violates FR-001.
### 5) Notifications for progress visibility
**Decision:** Use DB notifications for state transitions (queued/running/terminal) and keep a Run detail view as the primary progress surface in Phase 1.
**Rationale:**
- Inventory sync + backup schedule runs already use this pattern.
- Survives page reloads and doesnt require the user to keep the page open.
**Alternatives considered:**
- **Frontend polling only** (no DB notifications).
- Rejected: weaker UX and weaker observability.
## Clarifications resolved
- **SC-003 includes “canceled”** while Phase 1 explicitly has “no cancel”.
- Resolution for Phase 1 planning: treat “canceled” as out-of-scope (Phase 2+) and map “aborted” (if present) into the `failed` bucket for SC accounting.