TenantAtlas/specs/049-backup-restore-job-orchestration/spec.md
ahmido bcf4996a1e feat/049-backup-restore-job-orchestration (#56)
Summary

This PR implements Spec 049 – Backup/Restore Job Orchestration: all critical Backup/Restore execution paths are job-only, idempotent, tenant-scoped, and observable via run records + DB notifications (Phase 1). The UI no longer performs heavy Graph work inside request/Filament actions for these flows.

Why

We want predictable UX and operations at MSP scale:
	•	no timeouts / long-running requests
	•	reproducible run state + per-item results
	•	safe error persistence (no secrets / no token leakage)
	•	strict tenant isolation + auditability for write paths

What changed

Foundational (Runs + Idempotency + Observability)
	•	Added a shared RunIdempotency helper (dedupe while queued/running).
	•	Added a read-only BulkOperationRuns surface (list + view) for status/progress.
	•	Added DB notifications for run status changes (with “View run” link).

US1 – Policy “Capture snapshot” is job-only
	•	Policy detail “Capture snapshot” now:
	•	creates/reuses a run (dedupe key: tenant + policy.capture_snapshot + policy DB id)
	•	dispatches a queued job
	•	returns immediately with notification + link to run detail
	•	Graph capture work moved fully into the job; request path stays Graph-free.

US3 – Restore runs orchestration is job-only + safe
	•	Live restore execution is queued and updates RestoreRun status/progress.
	•	Per-item outcomes are persisted deterministically (per internal DB record).
	•	Audit logging is written for live restore.
	•	Preview/dry-run is enforced as read-only (no writes).

Tenant isolation / authorization (non-negotiable)
	•	Run list/view/start are tenant-scoped and policy-guarded (cross-tenant access => 403, not 404).
	•	Explicit Pest tests cover cross-tenant denial and start authorization.

Tests / Verification
	•	./vendor/bin/pint --dirty
	•	Targeted suite (examples):
	•	policy capture snapshot queued + idempotency tests
	•	restore orchestration + audit logging + preview read-only tests
	•	run authorization / tenant isolation tests

Notes / Scope boundaries
	•	Phase 1 UX = DB notifications + run detail page. A global “progress widget” is tracked as Phase 2 and not required for merge.
	•	Resilience/backoff is tracked in tasks but can be iterated further after merge.

Review focus
	•	Dedupe behavior for queued/running runs (reuse vs create-new)
	•	Tenant scoping & policy gates for all run surfaces
	•	Restore safety: audit event + preview no-writes

Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local>
Reviewed-on: #56
2026-01-11 15:59:06 +00:00

11 KiB

Feature Specification: Backup/Restore Job Orchestration (049)

Feature Branch: feat/049-backup-restore-job-orchestration
Created: 2026-01-11
Status: Draft
Input: Ensure Backup/Restore “start/execute” actions never run inline in an interactive request; they run via background processing with run records and visible progress.

Purpose

All Backup/Restore “Start/Execute” actions run exclusively via background processing with Run Records and visible progress. This prevents timeouts, double-click duplication, throttling issues, and improves reliability at MSP scale.

Non-Goals (Phase 1)

  • No new directory/group inventory or name resolution features (separate initiative)
  • No changes to external service contracts unless required for orchestration safety
  • No new promotion feature (e.g., DEV→PROD) (separate initiative)

Clarifications

Session 2026-01-11

  • Q: For FR-004 Idempotency, what should happen when the admin starts the same operation again for the same tenant + target while one is still queued/running? → A: Reuse existing run if identical is queued/running; allow a new run only after terminal.
  • Q: For FR-002 status lifecycle, do we support canceling runs in Phase 1? → A: No cancel in Phase 1.
  • Q: For FR-003 Progress visibility, which UI surfaces are required in Phase 1? → A: Phase 1 requires Run detail progress (counts/status) + DB notifications; Phase 2 adds a required global progress widget for all run types.
  • Q: For FR-004, how should we define the “target object” used for de-duplication? → A: Dedupe key uses (tenant + operation type + target object id).
  • Q: For FR-005 per-item outcome persistence, what is the item granularity for counts + item results in Phase 1? → A: Per internal DB record (e.g., restore/backup item rows).

User Scenarios & Testing (mandatory)

User Story 1 - Capture snapshot runs in background (Priority: P1)

An admin can start a “capture snapshot” operation without the UI hanging or timing out, and can see progress plus the final result.

Why this priority: Snapshot capture is a core workflow and a common source of long-running requests.

Independent Test: Starting a snapshot capture immediately returns to the UI with a queued Run Record that later transitions to a terminal state (success/failed/partial) and can be inspected.

Acceptance Scenarios:

  1. Given an admin has access to a tenant, When they start “capture snapshot”, Then the UI confirms it was queued and shows a link to the Run Record.
  2. Given a capture snapshot run is executing, When the admin views the run, Then they see progress (items done vs total) and any safe error summaries.

User Story 2 - Backup set create/capture runs in background (Priority: P2)

An admin can create a backup set and optionally start a capture/sync operation without the request doing heavy work.

Why this priority: Creating backup sets is frequent and should not be coupled to long-running capture logic.

Independent Test: Creating a backup set returns quickly and any capture/sync work appears as a run with progress.

Acceptance Scenarios:

  1. Given an admin creates a backup set with capture enabled, When they submit, Then the backup set is created and a capture run is queued.

User Story 3 - Restore runs in background with per-item results (Priority: P1)

An admin can start a “restore to Intune” or “re-run restore” operation as a background run and later inspect item-level outcomes and errors.

Why this priority: Restore is high-impact and must be resilient, observable, and safe under retries.

Independent Test: Starting restore creates a Run Record and item results that remain accessible even if the external service is unavailable.

Acceptance Scenarios:

  1. Given an admin starts a restore, When they confirm the action, Then the UI queues a run and returns immediately (no long-running request).
  2. Given a restore run finishes with mixed outcomes, When the admin views the run details, Then they see succeeded/failed counts and a safe error summary per failed item.
  3. Given an admin executes a live restore, When the run is queued/executed, Then an auditable event is recorded that links to the run.

User Story 4 - Dry-run/preview runs in background (Priority: P2)

An admin can run a dry-run/preview without UI timeouts, and the preview results are persisted and shown in the UI.

Why this priority: Preview supports safe change management and must remain usable even when the external service is slow or down.

Independent Test: Starting preview immediately creates a run; once finished, preview outputs are visible and reusable.

Acceptance Scenarios:

  1. Given an admin starts a preview run, When the run completes, Then the UI shows preview results without requiring re-execution.
  2. Given an admin starts a preview/dry-run, When the run executes, Then no write/change is performed against the external system.

Edge Cases

  • Double-clicking an action rapidly
  • Retrying while an identical run is already queued or running
  • External service is unavailable (e.g., throttling or outage)
  • A run gets stuck or exceeds expected duration
  • Permissions change after a run was queued

Requirements (mandatory)

Constitution alignment (required): If this feature introduces any Microsoft Graph calls or any write/change behavior, the spec MUST describe contract registry updates, safety gates (preview/confirmation/audit), tenant isolation, and tests.

Functional Requirements

  • FR-001 Job-only execution: The system MUST execute the following operations via background processing and MUST NOT perform heavy work inline during the interactive request:

    • Capture snapshot
    • Backup set create with capture/sync (when capture is triggered)
    • Restore to Intune
    • Re-run restore
    • Restore dry-run/preview
  • FR-002 Run Records: Each operation start MUST create (or deterministically re-use) a Run Record before the work begins, containing:

    • Tenant identity
    • Initiator identity (user reference or audit reference)
    • Operation type and optional target object reference
    • Status lifecycle: queued → running → (succeeded | failed | partial)
    • Started/finished timestamps
    • Item counts: total / succeeded / failed
    • Safe error code and safe error context (no secrets)
  • FR-003 Progress visibility: While a run is executing, the system MUST:

    • Emit in-app notifications for key state transitions (queued/running/completed/failed)
    • Provide a Run detail view that shows progress (status + item counts)

    Phase 1 does not require a global progress widget. Phase 2 MUST add a global progress widget, required for all run types.

  • FR-004 Idempotency & concurrency control: The system MUST prevent uncontrolled duplicate execution due to double-clicks/retries by enforcing a deterministic de-duplication rule keyed by (tenant + operation type + target object) or (tenant + run id). When an identical run is already queued/running, the UI MUST show “already queued/running” and link to the existing run.

    Clarification: For an identical start attempt while a run is queued or running, the system MUST re-use the existing Run Record and MUST NOT create a new run. A new run MAY be started only after the existing run reaches a terminal state.

    Clarification: For Phase 1, the default de-duplication key is (tenant + operation type + target object id).

  • FR-005 Deterministic outcome persistence: The system MUST persist per-item outcomes for operations that act on multiple items, including status and a safe error summary, so results can be viewed later without relying on logs.

    Clarification: In Phase 1, “item” refers to the internal DB record being acted on (e.g., a restore/backup item row). Counts (total/succeeded/failed) MUST be derived from these persisted item results.

  • FR-006 Tenant isolation & authorization: Run visibility and execution MUST be tenant-scoped. Only authorized admins can start operations, and users MUST NOT be able to view or start runs across tenants.

  • FR-007 Safety rules: Preview/dry-run MUST be safe (no writes). Live restore MUST remain guarded with explicit confirmation and an auditable trail consistent with existing safety practices.

  • FR-008 Resilience (Post-MVP / Phase 2): The system MUST handle external service throttling/outages gracefully, including retries with backoff when appropriate, and MUST end runs in a clear terminal state (failed/partial) rather than silently failing.

    Note: MVP/Phase 1 relies on existing retry behavior where present; standardized backoff + jitter hardening is scheduled post-MVP.

  • FR-009 Safe logging & data minimization: The system MUST NOT store secrets/tokens in Run Records, notifications, or error contexts. Error context MUST be limited to a defined, safe set of fields.

Acceptance Checks

  • Starting any in-scope operation returns quickly with a queued Run Record link.
  • A Run Record always exists before background work begins and reaches a terminal state.
  • Phase 1 does not support canceling runs.
  • Progress and state changes are visible via Run detail view and in-app notifications.
  • Phase 2 adds a global progress widget for all run types.
  • Duplicate start attempts for the same tenant + operation + target do not create uncontrolled duplicate execution.
  • Duplicate start attempts for the same tenant + operation + target while a run is queued/running re-use the existing run and link to it.
  • Item-level outcomes and safe error summaries are viewable after completion.
  • Run counts reflect persisted internal item results.
  • Preview/dry-run never performs writes.
  • Unauthorized users cannot start runs for a tenant they do not belong to.
  • Users cannot list/view run records across tenants.
  • Live restore creates an auditable event linked to the run.

Key Entities (include if feature involves data)

  • Run Record: A tenant-scoped record representing one started operation and its lifecycle, progress, and summary outcome.
  • Run Item Result: A tenant-scoped record representing the outcome for a single item processed as part of a Run Record.
  • Notification Event: A tenant-scoped event surfaced to the admin UI to communicate run state changes.

Success Criteria (mandatory)

Measurable Outcomes

  • SC-001: For 95% of operation starts, the UI confirms “queued” within 2 seconds.
  • SC-002: Double-clicking an operation start results in at most one queued/running run for the same tenant + operation + target.
  • SC-003: 99% of runs end in a clear terminal state (succeeded/failed/partial) with a human-readable summary.
  • SC-004: Admins can locate the latest run status for an operation in under 30 seconds without requiring access to system logs.

Note: “canceled” is reserved for Phase 2+ (Phase 1 has no cancel support).

Assumptions

  • This feature builds on the UI safety constraints from 048: admin pages must remain usable even when the external service API is unavailable.
  • Run Records and item results are retained long enough to support operational troubleshooting and audits, with retention managed as a separate policy.