ahmido bcf4996a1e feat/049-backup-restore-job-orchestration (#56 )

Summary

This PR implements Spec 049 – Backup/Restore Job Orchestration: all critical Backup/Restore execution paths are job-only, idempotent, tenant-scoped, and observable via run records + DB notifications (Phase 1). The UI no longer performs heavy Graph work inside request/Filament actions for these flows.

Why

We want predictable UX and operations at MSP scale:
• no timeouts / long-running requests
• reproducible run state + per-item results
• safe error persistence (no secrets / no token leakage)
• strict tenant isolation + auditability for write paths

What changed

Foundational (Runs + Idempotency + Observability)
• Added a shared RunIdempotency helper (dedupe while queued/running).
• Added a read-only BulkOperationRuns surface (list + view) for status/progress.
• Added DB notifications for run status changes (with “View run” link).

US1 – Policy “Capture snapshot” is job-only
• Policy detail “Capture snapshot” now:
• creates/reuses a run (dedupe key: tenant + policy.capture_snapshot + policy DB id)
• dispatches a queued job
• returns immediately with notification + link to run detail
• Graph capture work moved fully into the job; request path stays Graph-free.

US3 – Restore runs orchestration is job-only + safe
• Live restore execution is queued and updates RestoreRun status/progress.
• Per-item outcomes are persisted deterministically (per internal DB record).
• Audit logging is written for live restore.
• Preview/dry-run is enforced as read-only (no writes).

Tenant isolation / authorization (non-negotiable)
• Run list/view/start are tenant-scoped and policy-guarded (cross-tenant access => 403, not 404).
• Explicit Pest tests cover cross-tenant denial and start authorization.

Tests / Verification
• ./vendor/bin/pint --dirty
• Targeted suite (examples):
• policy capture snapshot queued + idempotency tests
• restore orchestration + audit logging + preview read-only tests
• run authorization / tenant isolation tests

Notes / Scope boundaries
• Phase 1 UX = DB notifications + run detail page. A global “progress widget” is tracked as Phase 2 and not required for merge.
• Resilience/backoff is tracked in tasks but can be iterated further after merge.

Review focus
• Dedupe behavior for queued/running runs (reuse vs create-new)
• Tenant scoping & policy gates for all run surfaces
• Restore safety: audit event + preview no-writes

Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local>
Reviewed-on: #56

2026-01-11 15:59:06 +00:00

11 KiB

Raw Blame History

Feature Specification: Backup/Restore Job Orchestration (049)

Feature Branch: feat/049-backup-restore-job-orchestration
Created: 2026-01-11
Status: Draft
Input: Ensure Backup/Restore “start/execute” actions never run inline in an interactive request; they run via background processing with run records and visible progress.

Purpose

All Backup/Restore “Start/Execute” actions run exclusively via background processing with Run Records and visible progress. This prevents timeouts, double-click duplication, throttling issues, and improves reliability at MSP scale.

Non-Goals (Phase 1)

No new directory/group inventory or name resolution features (separate initiative)
No changes to external service contracts unless required for orchestration safety
No new promotion feature (e.g., DEV→PROD) (separate initiative)

Clarifications

Session 2026-01-11

Q: For FR-004 Idempotency, what should happen when the admin starts the same operation again for the same tenant + target while one is still queued/running? → A: Reuse existing run if identical is queued/running; allow a new run only after terminal.
Q: For FR-002 status lifecycle, do we support canceling runs in Phase 1? → A: No cancel in Phase 1.
Q: For FR-003 Progress visibility, which UI surfaces are required in Phase 1? → A: Phase 1 requires Run detail progress (counts/status) + DB notifications; Phase 2 adds a required global progress widget for all run types.
Q: For FR-004, how should we define the “target object” used for de-duplication? → A: Dedupe key uses (tenant + operation type + target object id).
Q: For FR-005 per-item outcome persistence, what is the item granularity for counts + item results in Phase 1? → A: Per internal DB record (e.g., restore/backup item rows).

User Scenarios & Testing (mandatory)

User Story 1 - Capture snapshot runs in background (Priority: P1)

An admin can start a “capture snapshot” operation without the UI hanging or timing out, and can see progress plus the final result.

Why this priority: Snapshot capture is a core workflow and a common source of long-running requests.

Independent Test: Starting a snapshot capture immediately returns to the UI with a queued Run Record that later transitions to a terminal state (success/failed/partial) and can be inspected.

Acceptance Scenarios:

Given an admin has access to a tenant, When they start “capture snapshot”, Then the UI confirms it was queued and shows a link to the Run Record.
Given a capture snapshot run is executing, When the admin views the run, Then they see progress (items done vs total) and any safe error summaries.

User Story 2 - Backup set create/capture runs in background (Priority: P2)

An admin can create a backup set and optionally start a capture/sync operation without the request doing heavy work.

Why this priority: Creating backup sets is frequent and should not be coupled to long-running capture logic.

Independent Test: Creating a backup set returns quickly and any capture/sync work appears as a run with progress.

Acceptance Scenarios:

Given an admin creates a backup set with capture enabled, When they submit, Then the backup set is created and a capture run is queued.

User Story 3 - Restore runs in background with per-item results (Priority: P1)

An admin can start a “restore to Intune” or “re-run restore” operation as a background run and later inspect item-level outcomes and errors.

Why this priority: Restore is high-impact and must be resilient, observable, and safe under retries.

Independent Test: Starting restore creates a Run Record and item results that remain accessible even if the external service is unavailable.

Acceptance Scenarios:

Given an admin starts a restore, When they confirm the action, Then the UI queues a run and returns immediately (no long-running request).
Given a restore run finishes with mixed outcomes, When the admin views the run details, Then they see succeeded/failed counts and a safe error summary per failed item.
Given an admin executes a live restore, When the run is queued/executed, Then an auditable event is recorded that links to the run.

User Story 4 - Dry-run/preview runs in background (Priority: P2)

An admin can run a dry-run/preview without UI timeouts, and the preview results are persisted and shown in the UI.

Why this priority: Preview supports safe change management and must remain usable even when the external service is slow or down.

Independent Test: Starting preview immediately creates a run; once finished, preview outputs are visible and reusable.

Acceptance Scenarios:

Given an admin starts a preview run, When the run completes, Then the UI shows preview results without requiring re-execution.
Given an admin starts a preview/dry-run, When the run executes, Then no write/change is performed against the external system.

Edge Cases

Double-clicking an action rapidly
Retrying while an identical run is already queued or running
External service is unavailable (e.g., throttling or outage)
A run gets stuck or exceeds expected duration
Permissions change after a run was queued

Requirements (mandatory)

Constitution alignment (required): If this feature introduces any Microsoft Graph calls or any write/change behavior, the spec MUST describe contract registry updates, safety gates (preview/confirmation/audit), tenant isolation, and tests.

Functional Requirements

FR-001 Job-only execution: The system MUST execute the following operations via background processing and MUST NOT perform heavy work inline during the interactive request:
- Capture snapshot
- Backup set create with capture/sync (when capture is triggered)
- Restore to Intune
- Re-run restore
- Restore dry-run/preview
FR-002 Run Records: Each operation start MUST create (or deterministically re-use) a Run Record before the work begins, containing:
- Tenant identity
- Initiator identity (user reference or audit reference)
- Operation type and optional target object reference
- Status lifecycle: queued → running → (succeeded | failed | partial)
- Started/finished timestamps
- Item counts: total / succeeded / failed
- Safe error code and safe error context (no secrets)
FR-003 Progress visibility: While a run is executing, the system MUST:
- Emit in-app notifications for key state transitions (queued/running/completed/failed)
- Provide a Run detail view that shows progress (status + item counts)
Phase 1 does not require a global progress widget. Phase 2 MUST add a global progress widget, required for all run types.
FR-004 Idempotency & concurrency control: The system MUST prevent uncontrolled duplicate execution due to double-clicks/retries by enforcing a deterministic de-duplication rule keyed by (tenant + operation type + target object) or (tenant + run id). When an identical run is already queued/running, the UI MUST show “already queued/running” and link to the existing run.

Clarification: For an identical start attempt while a run is queued or running, the system MUST re-use the existing Run Record and MUST NOT create a new run. A new run MAY be started only after the existing run reaches a terminal state.

Clarification: For Phase 1, the default de-duplication key is (tenant + operation type + target object id).
FR-005 Deterministic outcome persistence: The system MUST persist per-item outcomes for operations that act on multiple items, including status and a safe error summary, so results can be viewed later without relying on logs.

Clarification: In Phase 1, “item” refers to the internal DB record being acted on (e.g., a restore/backup item row). Counts (total/succeeded/failed) MUST be derived from these persisted item results.
FR-006 Tenant isolation & authorization: Run visibility and execution MUST be tenant-scoped. Only authorized admins can start operations, and users MUST NOT be able to view or start runs across tenants.
FR-007 Safety rules: Preview/dry-run MUST be safe (no writes). Live restore MUST remain guarded with explicit confirmation and an auditable trail consistent with existing safety practices.
FR-008 Resilience (Post-MVP / Phase 2): The system MUST handle external service throttling/outages gracefully, including retries with backoff when appropriate, and MUST end runs in a clear terminal state (failed/partial) rather than silently failing.

Note: MVP/Phase 1 relies on existing retry behavior where present; standardized backoff + jitter hardening is scheduled post-MVP.
FR-009 Safe logging & data minimization: The system MUST NOT store secrets/tokens in Run Records, notifications, or error contexts. Error context MUST be limited to a defined, safe set of fields.

Acceptance Checks

Starting any in-scope operation returns quickly with a queued Run Record link.
A Run Record always exists before background work begins and reaches a terminal state.
Phase 1 does not support canceling runs.
Progress and state changes are visible via Run detail view and in-app notifications.
Phase 2 adds a global progress widget for all run types.
Duplicate start attempts for the same tenant + operation + target do not create uncontrolled duplicate execution.
Duplicate start attempts for the same tenant + operation + target while a run is queued/running re-use the existing run and link to it.
Item-level outcomes and safe error summaries are viewable after completion.
Run counts reflect persisted internal item results.
Preview/dry-run never performs writes.
Unauthorized users cannot start runs for a tenant they do not belong to.
Users cannot list/view run records across tenants.
Live restore creates an auditable event linked to the run.

Key Entities (include if feature involves data)

Run Record: A tenant-scoped record representing one started operation and its lifecycle, progress, and summary outcome.
Run Item Result: A tenant-scoped record representing the outcome for a single item processed as part of a Run Record.
Notification Event: A tenant-scoped event surfaced to the admin UI to communicate run state changes.

Success Criteria (mandatory)

Measurable Outcomes

SC-001: For 95% of operation starts, the UI confirms “queued” within 2 seconds.
SC-002: Double-clicking an operation start results in at most one queued/running run for the same tenant + operation + target.
SC-003: 99% of runs end in a clear terminal state (succeeded/failed/partial) with a human-readable summary.
SC-004: Admins can locate the latest run status for an operation in under 30 seconds without requiring access to system logs.

Note: “canceled” is reserved for Phase 2+ (Phase 1 has no cancel support).

Assumptions

This feature builds on the UI safety constraints from 048: admin pages must remain usable even when the external service API is unavailable.
Run Records and item results are retained long enough to support operational troubleshooting and audits, with retention managed as a separate policy.

11 KiB Raw Blame History

Feature Specification: Backup/Restore Job Orchestration (049)

Purpose

Non-Goals (Phase 1)

Clarifications

Session 2026-01-11

User Scenarios & Testing (mandatory)

User Story 1 - Capture snapshot runs in background (Priority: P1)

User Story 2 - Backup set create/capture runs in background (Priority: P2)

User Story 3 - Restore runs in background with per-item results (Priority: P1)

User Story 4 - Dry-run/preview runs in background (Priority: P2)

Edge Cases

Requirements (mandatory)

Functional Requirements

Acceptance Checks

Key Entities (include if feature involves data)

Success Criteria (mandatory)

Measurable Outcomes

Assumptions

11 KiB

Raw Blame History