TenantAtlas/specs/049-backup-restore-job-orchestration/tasks.md
ahmido bcf4996a1e feat/049-backup-restore-job-orchestration (#56)
Summary

This PR implements Spec 049 – Backup/Restore Job Orchestration: all critical Backup/Restore execution paths are job-only, idempotent, tenant-scoped, and observable via run records + DB notifications (Phase 1). The UI no longer performs heavy Graph work inside request/Filament actions for these flows.

Why

We want predictable UX and operations at MSP scale:
	•	no timeouts / long-running requests
	•	reproducible run state + per-item results
	•	safe error persistence (no secrets / no token leakage)
	•	strict tenant isolation + auditability for write paths

What changed

Foundational (Runs + Idempotency + Observability)
	•	Added a shared RunIdempotency helper (dedupe while queued/running).
	•	Added a read-only BulkOperationRuns surface (list + view) for status/progress.
	•	Added DB notifications for run status changes (with “View run” link).

US1 – Policy “Capture snapshot” is job-only
	•	Policy detail “Capture snapshot” now:
	•	creates/reuses a run (dedupe key: tenant + policy.capture_snapshot + policy DB id)
	•	dispatches a queued job
	•	returns immediately with notification + link to run detail
	•	Graph capture work moved fully into the job; request path stays Graph-free.

US3 – Restore runs orchestration is job-only + safe
	•	Live restore execution is queued and updates RestoreRun status/progress.
	•	Per-item outcomes are persisted deterministically (per internal DB record).
	•	Audit logging is written for live restore.
	•	Preview/dry-run is enforced as read-only (no writes).

Tenant isolation / authorization (non-negotiable)
	•	Run list/view/start are tenant-scoped and policy-guarded (cross-tenant access => 403, not 404).
	•	Explicit Pest tests cover cross-tenant denial and start authorization.

Tests / Verification
	•	./vendor/bin/pint --dirty
	•	Targeted suite (examples):
	•	policy capture snapshot queued + idempotency tests
	•	restore orchestration + audit logging + preview read-only tests
	•	run authorization / tenant isolation tests

Notes / Scope boundaries
	•	Phase 1 UX = DB notifications + run detail page. A global “progress widget” is tracked as Phase 2 and not required for merge.
	•	Resilience/backoff is tracked in tasks but can be iterated further after merge.

Review focus
	•	Dedupe behavior for queued/running runs (reuse vs create-new)
	•	Tenant scoping & policy gates for all run surfaces
	•	Restore safety: audit event + preview no-writes

Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local>
Reviewed-on: #56
2026-01-11 15:59:06 +00:00

12 KiB
Raw Permalink Blame History

Tasks: Backup/Restore Job Orchestration (049)

Input: Design documents from specs/049-backup-restore-job-orchestration/

Prerequisites: plan.md (required), spec.md (required), research.md, data-model.md, contracts/, quickstart.md

Tests: REQUIRED (Pest) for these runtime behavior changes.

MVP scope: Strictly limited to T001T016 (US1 only). The Phase 7 global progress widget (T037) is Phase 2 and explicitly NOT part of the MVP.

Phase 1: Setup (Shared Infrastructure)

  • T001 Verify queue + DB notifications prerequisites in config/queue.php and database/migrations/notifications (add missing migration if needed)
  • T002 Confirm existing run tables and status enums used by RestoreRun in app/Support/RestoreRunStatus.php and database/migrations/2025_12_10_000150_create_restore_runs_table.php
  • T003 [P] Add quickstart sanity commands for this feature in specs/049-backup-restore-job-orchestration/quickstart.md

Phase 2: Foundational (Blocking Prerequisites)

⚠️ CRITICAL: No user story work should begin until this phase is complete.

  • T004 Add idempotency support to bulk_operation_runs via database/migrations/2026_01_11_120001_add_idempotency_key_to_bulk_operation_runs_table.php
  • T005 Add idempotency support to restore_runs via database/migrations/2026_01_11_120002_add_idempotency_key_to_restore_runs_table.php
  • T006 [P] Add casts/fillables for idempotency + timestamps in app/Models/BulkOperationRun.php and app/Models/RestoreRun.php
  • T007 Implement idempotency key helpers in app/Support/RunIdempotency.php (build key, find active run, enforce reuse)
  • T008 [P] Add a read-only Filament resource to inspect run details for BulkOperationRun in app/Filament/Resources/BulkOperationRunResource.php
  • T009 [P] Add notification for run status transitions in app/Notifications/RunStatusChangedNotification.php (DB channel)
  • T010 Add unit tests for RunIdempotency helpers in tests/Unit/RunIdempotencyTest.php

CRITICAL (must-fix before implementing any new run flows): Tenant isolation + authorization

  • T042 Add tenant-scoped authorization for run list/view/start across all run flows (BulkOperationRun + RestoreRun) using policies/resources and ensure every query is tenant-scoped (e.g., app/Filament/Resources/BulkOperationRunResource.php, app/Filament/Resources/RestoreRunResource.php, and each start action/page that creates runs)
  • T043 [P] Add Pest feature tests that run list/view are tenant-scoped (cannot list/view another tenants runs) in tests/Feature/RunAuthorizationTenantIsolationTest.php
  • T044 [P] Add Pest feature tests that unaffiliated users cannot start runs (capture snapshot / restore execute / preview / backup set capture) in tests/Feature/RunStartAuthorizationTest.php

Checkpoint: Foundation ready (idempotency + run detail view + notifications).


Phase 3: User Story 1 - Capture snapshot runs in background (Priority: P1) 🎯 MVP

Goal: Capturing a policy snapshot never blocks the UI; it creates/reuses a run record and processes in a queued job with visible progress.

Independent Test: Trigger “Capture snapshot” on a policy; the request returns quickly and a BulkOperationRun transitions queued → running → succeeded|failed|partial, with details viewable.

Tests (write first)

  • T011 [P] [US1] Add Pest feature test that capture snapshot queues a job (no inline capture) in tests/Feature/PolicyCaptureSnapshotQueuedTest.php
  • T012 [P] [US1] Add Pest feature test that double-click reuses the active run (idempotency) in tests/Feature/PolicyCaptureSnapshotIdempotencyTest.php

Implementation

  • T013 [US1] Create queued job to capture one policy snapshot in app/Jobs/CapturePolicySnapshotJob.php (updates BulkOperationRun counts + failures)
  • T014 [US1] Update UI action to create/reuse run and dispatch job in app/Filament/Resources/PolicyResource/Pages/ViewPolicy.php
  • T015 [P] [US1] Add linking from UI notifications to BulkOperationRunResource view page in app/Filament/Resources/BulkOperationRunResource.php
  • T016 [US1] Ensure failures are safe/minimized (no secrets) when recording run failures in app/Services/BulkOperationService.php

Checkpoint: User Story 1 is independently usable and testable.


Phase 4: User Story 3 - Restore runs in background with per-item results (Priority: P1)

Goal: Restore execution and re-run restore operate exclusively via queued jobs, with persisted per-item outcomes and safe error summaries visible in the run detail UI.

Independent Test: Starting restore creates/reuses a RestoreRun in queued state, queues execution, and later shows item outcomes without relying on logs.

Tests (write first)

  • T017 [P] [US3] Add Pest feature test that restore execution reuses active run for identical (tenant+backup_set+scope) starts in tests/Feature/RestoreRunIdempotencyTest.php
  • T018 [P] [US3] Extend existing restore job test to assert per-item outcome persistence in tests/Feature/ExecuteRestoreRunJobTest.php
  • T045 [P] [US3] Add Pest feature test that live restore writes an audit event (run-id linked) in tests/Feature/RestoreAuditLoggingTest.php

Implementation

  • T019 [US3] Implement idempotency key computation for restore runs (tenant + operation + target + scope hash) in app/Support/RunIdempotency.php
  • T020 [US3] Update restore run creation/execute flow to reuse active runs (no duplicates) in app/Filament/Resources/RestoreRunResource.php
  • T021 [US3] Update app/Jobs/ExecuteRestoreRunJob.php to set started/finished timestamps and emit DB notifications (queued/running/terminal)
  • T022 [US3] Persist deterministic per-item outcomes into restore_runs.results (keyed by backup_item_id) in app/Services/Intune/RestoreService.php
  • T023 [US3] Derive total/succeeded/failed counts from persisted results and surface in RestoreRunResource view/table in app/Filament/Resources/RestoreRunResource.php
  • T046 [US3] Ensure live restore execution emits an auditable event linked to the run (e.g., audit_logs FK or structured audit record) in app/Jobs/ExecuteRestoreRunJob.php and/or app/Services/Intune/RestoreService.php

Checkpoint: Restore runs are job-only, idempotent, and observable with item outcomes.


Phase 5: User Story 2 - Backup set create/capture runs in background (Priority: P2)

Goal: Creating a backup set and adding policies to a backup set does not perform Graph-heavy snapshot capture inline; capture occurs in jobs with a run record.

Independent Test: Creating a backup set returns quickly and produces a BulkOperationRun showing progress; adding policies via the picker also queues work.

Tests (write first)

  • T024 [P] [US2] Add Pest feature test that backup set create does not run capture inline and instead queues a job in tests/Feature/BackupSetCreateCaptureQueuedTest.php
  • T025 [P] [US2] Add Pest feature test that “Add selected” in policy picker queues background work in tests/Feature/BackupSetPolicyPickerQueuesCaptureTest.php

Implementation

  • T026 [US2] Refactor capture work out of BackupService::createBackupSet into separate methods in app/Services/Intune/BackupService.php
  • T027 [US2] Create queued job to capture backup set items in app/Jobs/CaptureBackupSetJob.php (uses BackupService; updates BulkOperationRun)
  • T028 [US2] Update backup set create flow to create backup_set record quickly and dispatch CaptureBackupSetJob in app/Filament/Resources/BackupSetResource.php
  • T029 [US2] Create queued job to add policies to a backup set (and capture foundations if requested) in app/Jobs/AddPoliciesToBackupSetJob.php
  • T030 [US2] Update bulk action in app/Livewire/BackupSetPolicyPickerTable.php to create/reuse BulkOperationRun and dispatch AddPoliciesToBackupSetJob

Checkpoint: Backup set capture workloads are job-only and observable.


Phase 6: User Story 4 - Dry-run/preview runs in background (Priority: P2)

Goal: Restore preview generation is queued, persisted, and viewable without re-execution.

Independent Test: Clicking “Generate preview” returns quickly; a queued RestoreRun performs the diff generation asynchronously and persists preview output that the UI can display.

Tests (write first)

  • T031 [P] [US4] Add Pest feature test that preview generation queues a job (no inline RestoreDiffGenerator call) in tests/Feature/RestorePreviewQueuedTest.php
  • T032 [P] [US4] Add Pest feature test that preview results persist and are reusable in tests/Feature/RestorePreviewPersistenceTest.php
  • T047 [P] [US4] Add Pest feature test that preview/dry-run never performs writes (must be read-only) in tests/Feature/RestorePreviewReadOnlySafetyTest.php

Implementation

  • T033 [US4] Create queued job to generate preview diffs and persist to restore_runs.preview + metadata in app/Jobs/GenerateRestorePreviewJob.php
  • T034 [US4] Update preview action in app/Filament/Resources/RestoreRunResource.php to create/reuse a dry-run RestoreRun and dispatch GenerateRestorePreviewJob
  • T035 [US4] Update restore run view component to read preview from the persisted run record in resources/views/filament/forms/components/restore-run-preview.blade.php
  • T036 [US4] Emit DB notifications for preview queued/running/completed/failed transitions in app/Jobs/GenerateRestorePreviewJob.php
  • T048 [US4] Enforce preview/dry-run read-only behavior: block write-capable operations and record a safe failure if a write would occur (in app/Jobs/GenerateRestorePreviewJob.php and/or restore diff generation service)

Checkpoint: Preview is asynchronous, persisted, and visible.


Phase 7: Phase 2 - Global Progress Widget (All Run Types)

  • T037 [P] Add a global progress widget for restore runs (Phase 2 requirement) by extending app/Livewire/BulkOperationProgress.php or adding a dedicated Livewire component in app/Livewire/RestoreRunProgress.php

Phase 8: Polish & Cross-Cutting Concerns

  • T038 Ensure Graph throttling/backoff behavior is applied inside queued jobs (429/503) in app/Services/Intune/PolicySnapshotService.php and app/Services/Intune/RestoreService.php
  • T039 [P] Add/extend run status notification formatting to include safe error codes/contexts in app/Notifications/RunStatusChangedNotification.php
  • T040 Run formatter on modified files: vendor/bin/pint --dirty
  • T041 Run targeted tests for affected areas: tests/Feature/Restore tests/Feature/BackupSet tests/Feature/Policy (use php artisan test with filters)

Dependencies & Execution Order

Story order

  • Phase 1 → Phase 2 must complete first.
  • After Phase 2:
    • US1 and US3 can proceed in parallel.
    • US4 can proceed in parallel but may be easiest after US3 (shared RestoreRun patterns).
    • US2 can proceed independently after Phase 2.

Dependency graph

  • Setup → Foundational → { US1, US2, US3, US4 } → Polish
  • Setup → Foundational → { US1, US2, US3, US4 } → Phase 2 Global Widget → Polish
  • Suggested minimal MVP: Setup → Foundational → US1

Parallel execution examples

US1

  • In parallel: T011 (queues test), T012 (idempotency test)
  • In parallel: T013 (job), T014 (UI action update) after foundational tasks

US2

  • In parallel: T024 (create queues test), T025 (picker queues test)
  • In parallel: T027 (job) and T029 (job) after BackupService refactor task T026

US3

  • In parallel: T017 (idempotency test), T018 (job behavior test)
  • In parallel: T021 (job notifications) and T023 (UI view enhancements) once results format is defined

US4

  • In parallel: T031 (queues test), T032 (persistence test)
  • In parallel: T033 (job) and T035 (view reads persisted preview) once run persistence shape is agreed

Implementation strategy

  • MVP (fastest value): deliver US1 first (policy snapshot capture becomes queued + idempotent + observable).
  • Next: US3 + US4 to fully de-risk restore execution and preview.
  • Then: US2 to eliminate inline Graph work from backup set flows.

Format validation

All tasks above follow the required checklist format: - [ ] T### [P?] [US#?] Description with file path