ahmido bcf4996a1e feat/049-backup-restore-job-orchestration (#56 )

Summary

This PR implements Spec 049 – Backup/Restore Job Orchestration: all critical Backup/Restore execution paths are job-only, idempotent, tenant-scoped, and observable via run records + DB notifications (Phase 1). The UI no longer performs heavy Graph work inside request/Filament actions for these flows.

Why

We want predictable UX and operations at MSP scale:
• no timeouts / long-running requests
• reproducible run state + per-item results
• safe error persistence (no secrets / no token leakage)
• strict tenant isolation + auditability for write paths

What changed

Foundational (Runs + Idempotency + Observability)
• Added a shared RunIdempotency helper (dedupe while queued/running).
• Added a read-only BulkOperationRuns surface (list + view) for status/progress.
• Added DB notifications for run status changes (with “View run” link).

US1 – Policy “Capture snapshot” is job-only
• Policy detail “Capture snapshot” now:
• creates/reuses a run (dedupe key: tenant + policy.capture_snapshot + policy DB id)
• dispatches a queued job
• returns immediately with notification + link to run detail
• Graph capture work moved fully into the job; request path stays Graph-free.

US3 – Restore runs orchestration is job-only + safe
• Live restore execution is queued and updates RestoreRun status/progress.
• Per-item outcomes are persisted deterministically (per internal DB record).
• Audit logging is written for live restore.
• Preview/dry-run is enforced as read-only (no writes).

Tenant isolation / authorization (non-negotiable)
• Run list/view/start are tenant-scoped and policy-guarded (cross-tenant access => 403, not 404).
• Explicit Pest tests cover cross-tenant denial and start authorization.

Tests / Verification
• ./vendor/bin/pint --dirty
• Targeted suite (examples):
• policy capture snapshot queued + idempotency tests
• restore orchestration + audit logging + preview read-only tests
• run authorization / tenant isolation tests

Notes / Scope boundaries
• Phase 1 UX = DB notifications + run detail page. A global “progress widget” is tracked as Phase 2 and not required for merge.
• Resilience/backoff is tracked in tasks but can be iterated further after merge.

Review focus
• Dedupe behavior for queued/running runs (reuse vs create-new)
• Tenant scoping & policy gates for all run surfaces
• Restore safety: audit event + preview no-writes

Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local>
Reviewed-on: #56

2026-01-11 15:59:06 +00:00

3.3 KiB

Raw Permalink Blame History

Data Model: Backup/Restore Job Orchestration (049)

This feature relies on existing “run record” models/tables and (optionally) extends them to meet the orchestration requirements.

Entities

1) RestoreRun (`restore_runs`)

Purpose: Run record for restore executions and dry-run/preview workflows.

Model: App\Models\RestoreRun

Key fields (existing):

id (PK)
tenant_id (FK → tenants)
backup_set_id (FK → backup_sets)
requested_by (string|null)
is_dry_run (bool)
status (string)
requested_items (json|null)
preview (json|null) — persisted preview output
results (json|null) — persisted execution output (may include per-item outcomes)
failure_reason (text|null)
started_at / completed_at (timestamp|null)
metadata (json|null)

Relationships:

RestoreRun belongsTo Tenant
RestoreRun belongsTo BackupSet

State transitions (target):

queued → running → succeeded|failed|partial

Validation constraints (creation/dispatch):

tenant-scoped access required
backup_set_id must belong to tenant
preview/dry-run must not perform writes (constitution Read/Write Separation)

2) BulkOperationRun (`bulk_operation_runs`)

Purpose: Run record for background operations that process many internal items, including backup-set capture-like actions.

Model: App\Models\BulkOperationRun

Key fields (existing):

id (PK)
tenant_id (FK → tenants)
user_id (FK → users)
resource (string) — e.g. policy, backup_set
action (string) — e.g. export, add_policies
status (string) — pending, running, completed, completed_with_errors, failed, aborted
total_items, processed_items, succeeded, failed, skipped
item_ids (jsonb)
failures (jsonb|null) — safe per-item error summaries
audit_log_id (FK → audit_logs|null)

Relationships:

BulkOperationRun belongsTo Tenant
BulkOperationRun belongsTo User

Recommended additions (to satisfy FR-002/FR-004 cleanly):

idempotency_key (string, indexed; uniqueness enforced for active statuses via partial index)
started_at / finished_at (timestampTz)
error_code (string|null)
error_context (jsonb|null)

State transitions (target):

queued → running → succeeded|failed|partial
- pending maps to queued
- completed_with_errors maps to partial

3) Notification Event (DB notifications)

Purpose: Persist state transitions and completion notices for the initiating user.

Storage: Laravel Notifications (DB channel).

Payload shape (target):

tenant_id
run_type (restore_run / bulk_operation_run)
run_id
status (queued/running/succeeded/failed/partial)
counts (optional)
safe_error_code + safe_error_context (optional)

Notes on “per-item outcomes” (FR-005)

For restore workflows, per-item outcomes can initially be stored in restore_runs.results as a structured JSON array/object keyed by internal item identifiers.
For bulk operations, per-item outcomes are already persisted as bulk_operation_runs.failures plus the counter columns.
If Phase 1 needs relational per-item tables for querying/filtering, introduce a dedicated “run item results” table per run type (Phase 2+ preferred).

3.3 KiB Raw Permalink Blame History