TenantAtlas/specs/049-backup-restore-job-orchestration/research.md
ahmido bcf4996a1e feat/049-backup-restore-job-orchestration (#56)
Summary

This PR implements Spec 049 – Backup/Restore Job Orchestration: all critical Backup/Restore execution paths are job-only, idempotent, tenant-scoped, and observable via run records + DB notifications (Phase 1). The UI no longer performs heavy Graph work inside request/Filament actions for these flows.

Why

We want predictable UX and operations at MSP scale:
	•	no timeouts / long-running requests
	•	reproducible run state + per-item results
	•	safe error persistence (no secrets / no token leakage)
	•	strict tenant isolation + auditability for write paths

What changed

Foundational (Runs + Idempotency + Observability)
	•	Added a shared RunIdempotency helper (dedupe while queued/running).
	•	Added a read-only BulkOperationRuns surface (list + view) for status/progress.
	•	Added DB notifications for run status changes (with “View run” link).

US1 – Policy “Capture snapshot” is job-only
	•	Policy detail “Capture snapshot” now:
	•	creates/reuses a run (dedupe key: tenant + policy.capture_snapshot + policy DB id)
	•	dispatches a queued job
	•	returns immediately with notification + link to run detail
	•	Graph capture work moved fully into the job; request path stays Graph-free.

US3 – Restore runs orchestration is job-only + safe
	•	Live restore execution is queued and updates RestoreRun status/progress.
	•	Per-item outcomes are persisted deterministically (per internal DB record).
	•	Audit logging is written for live restore.
	•	Preview/dry-run is enforced as read-only (no writes).

Tenant isolation / authorization (non-negotiable)
	•	Run list/view/start are tenant-scoped and policy-guarded (cross-tenant access => 403, not 404).
	•	Explicit Pest tests cover cross-tenant denial and start authorization.

Tests / Verification
	•	./vendor/bin/pint --dirty
	•	Targeted suite (examples):
	•	policy capture snapshot queued + idempotency tests
	•	restore orchestration + audit logging + preview read-only tests
	•	run authorization / tenant isolation tests

Notes / Scope boundaries
	•	Phase 1 UX = DB notifications + run detail page. A global “progress widget” is tracked as Phase 2 and not required for merge.
	•	Resilience/backoff is tracked in tasks but can be iterated further after merge.

Review focus
	•	Dedupe behavior for queued/running runs (reuse vs create-new)
	•	Tenant scoping & policy gates for all run surfaces
	•	Restore safety: audit event + preview no-writes

Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local>
Reviewed-on: #56
2026-01-11 15:59:06 +00:00

3.9 KiB
Raw Blame History

Research: Backup/Restore Job Orchestration (049)

This document resolves Phase 0 open questions and records design choices.

Decisions

1) Run Record storage strategy

Decision: Reuse existing run-record primitives instead of introducing a brand-new “unified run” subsystem in Phase 1.

  • Restore + re-run restore + dry-run/preview: use the existing restore_runs table / App\Models\RestoreRun.
  • Backup set capture-like operations (e.g., “add policies and capture”): reuse bulk_operation_runs / App\Models\BulkOperationRun (already used for long-running background work like bulk exports) and (if needed) extend it to satisfy FR-002 fields.

Rationale:

  • The codebase already has multiple proven “run tables” (restore_runs, inventory_sync_runs, backup_schedule_runs, bulk_operation_runs).
  • Minimizes migration risk and avoids broad refactors.
  • Lets Phase 1 focus on eliminating inline heavy work while keeping UX consistent.

Alternatives considered:

  • Create a new generic operation_runs + operation_run_items data model for all queued automation.
    • Rejected (Phase 1): higher migration + backfill cost; high coordination risk across many features.

2) Status lifecycle mapping

Decision: Standardize at the UI + plan level on queued → running → (succeeded | failed | partial) while allowing underlying storage to keep its existing status vocabulary.

  • BulkOperationRun.status mapping: pending→queued, running→running, completed→succeeded, completed_with_errors→partial, failed/aborted→failed.
  • RestoreRun.status mapping will be aligned (e.g., pending→queued, running→running, etc.) as part of implementation.

Rationale:

  • Keeps the specs lifecycle consistent without forcing an immediate cross-table refactor.

Alternatives considered:

  • Rename and normalize all run statuses across all run tables.
    • Rejected (Phase 1): touches many workflows and tests.

3) Idempotency & de-duplication

Decision: Enforce de-duplication for active runs via a deterministic key and a DB query gate, with an optional lock for race reduction.

  • Dedupe key format: tenant_id + operation_type + target_object_id (plus a stable hash of relevant payload if needed).
  • Behavior: if an identical run is queued/running, reuse it and return/link to it; allow a new run only after terminal.

Rationale:

  • Matches the constitution (“Automation must be Idempotent & Observable”) and aligns with existing patterns (inventory selection hash + schedule locks).

Alternatives considered:

  • Cache-only locks (Cache::lock(...)) without persisted keys.
    • Rejected: harder to reason about after restarts; less observable.

4) Restore preview must be asynchronous

Decision: Move restore preview generation (“Generate preview” in the wizard) into a queued job which persists preview outputs to the run record.

Rationale:

  • Preview can require Graph calls and normalization work; it should never block an interactive request.

Alternatives considered:

  • Keep preview synchronous and increase timeouts.
    • Rejected: timeouts, poor UX, and violates FR-001.

5) Notifications for progress visibility

Decision: Use DB notifications for state transitions (queued/running/terminal) and keep a Run detail view as the primary progress surface in Phase 1.

Rationale:

  • Inventory sync + backup schedule runs already use this pattern.
  • Survives page reloads and doesnt require the user to keep the page open.

Alternatives considered:

  • Frontend polling only (no DB notifications).
    • Rejected: weaker UX and weaker observability.

Clarifications resolved

  • SC-003 includes “canceled” while Phase 1 explicitly has “no cancel”.
    • Resolution for Phase 1 planning: treat “canceled” as out-of-scope (Phase 2+) and map “aborted” (if present) into the failed bucket for SC accounting.