spec: 053 unify runs monitoring

2026-01-16 16:04:17 +01:00 · 2026-01-16 16:04:17 +01:00 · d9554010ac
commit d9554010ac
parent c60d16ffba
9 changed files with 869 additions and 0 deletions
--- a/specs/053-unify-runs-monitoring/checklists/requirements.md
+++ b/specs/053-unify-runs-monitoring/checklists/requirements.md
@ -0,0 +1,35 @@
+# Specification Quality Checklist: Unified Operations Runs + Monitoring Hub (053)
+
+**Purpose**: Validate specification completeness and quality before proceeding to planning  
+**Created**: 2026-01-15  
+**Feature**: [spec.md](../spec.md)
+
+## Content Quality
+
+- [x] No implementation details (languages, frameworks, APIs)
+- [x] Focused on user value and business needs
+- [x] Written for non-technical stakeholders
+- [x] All mandatory sections completed
+
+## Requirement Completeness
+
+- [x] No [NEEDS CLARIFICATION] markers remain
+- [x] Requirements are testable and unambiguous
+- [x] Success criteria are measurable
+- [x] Success criteria are technology-agnostic (no implementation details)
+- [x] All acceptance scenarios are defined
+- [x] Edge cases are identified
+- [x] Scope is clearly bounded
+- [x] Dependencies and assumptions identified
+
+## Feature Readiness
+
+- [x] All functional requirements have clear acceptance criteria
+- [x] User scenarios cover primary flows
+- [x] Feature meets measurable outcomes defined in Success Criteria
+- [x] No implementation details leak into specification
+
+## Notes
+
+- Items marked incomplete require spec updates before `/speckit.clarify` or `/speckit.plan`
+- Validation run on 2026-01-15; no issues found.
--- a/specs/053-unify-runs-monitoring/checklists/writing.md
+++ b/specs/053-unify-runs-monitoring/checklists/writing.md
@ -0,0 +1,78 @@
+# Requirements Writing Checklist: Unified Operations Runs + Monitoring Hub (053)
+
+**Purpose**: “Unit tests for English” to validate the clarity, completeness, and internal consistency of `spec.md` before implementation planning  
+**Created**: 2026-01-16  
+**Feature**: [spec.md](../spec.md)
+
+**Note**: These items validate what the requirements *say* (or don’t say). They are not implementation/QA tests.
+
+## Requirement Completeness
+
+- [x] CHK001 Are all required run attributes explicitly enumerated (type, status, timestamps, initiator, label)? [Completeness, Spec §FR-001] — Evidence: Spec §FR-001 “...run type, scope/target (when applicable), status, timestamps (created/started/finished), initiator (user or system), and a human-readable label...”
+- [x] CHK002 Are list requirements complete for Monitoring/Operations (default sort, default time window, and filterable fields/values)? [Completeness, Spec §FR-002] — Evidence: Spec §FR-002 “...sorted most-recent-first by default, defaulting to... (last 30 days), and supporting filtering by run type, status, and time range...”
+- [x] CHK003 Are run detail requirements complete, including which summary counts are required and when they apply (total/succeeded/failed/skipped)? [Completeness, Spec §FR-003] — Evidence: Spec §FR-003 “For itemized operations... counts MUST include `total`, `succeeded`, `failed`, and `skipped` (if applicable).”
+- [x] CHK004 Are Phase 1 included operations explicitly listed, and are all other candidate operations explicitly deferred? [Completeness, Spec §Scope & Assumptions] — Evidence: Spec §Scope & Assumptions “Phase 1 supported operations are: Drift generation; Backup Set “Add Policies”. All other candidate operations are explicitly deferred to Phase 2+...”
+- [x] CHK005 Are “related artifact” link requirements defined per Phase 1 operation (drift findings link; backup set context link)? [Completeness, Spec §FR-012, Spec §FR-013] — Evidence: Spec §FR-012 “...run detail view MUST provide a link to that artifact.” + Spec §FR-013 “...link back to the related backup set context.”
+- [x] CHK006 Are notification requirements complete across lifecycle events, including who receives them (initiator only vs tenant operators)? [Completeness, Spec §FR-011] — Evidence: Spec §FR-011 “Phase 1 notifications MUST be delivered to the initiating user...”; Spec §User Story 2 scenarios 4–5 define queued + completion notifications.
+
+## Requirement Clarity
+
+- [x] CHK007 Is “supported long-running operation” bounded and unambiguous for Phase 1 (what is included/excluded)? [Clarity, Spec §Scope & Assumptions] — Evidence: Spec §Scope & Assumptions “Phase 1 supported operations are: Drift generation; Backup Set “Add Policies”...”
+- [x] CHK008 Is the run status set closed or open, and are optional statuses (e.g., canceled/aborted) explicitly addressed as included or excluded? [Clarity, Spec §FR-004] — Evidence: Spec §FR-004 “Phase 1 status set: `queued`, `running`, `succeeded`, `partially succeeded`, `failed`... Cancellation/abort outcomes are deferred to Phase 2.”
+- [x] CHK009 Is the dedupe definition of “identical run” unambiguous (scope components, time window, and whether initiator matters)? [Clarity, Spec §FR-006] — Evidence: Spec §FR-006 “Identical means... same tenant... same run type... same scope/target... same effective inputs... The initiator MUST NOT be part of the identity...”
+- [x] CHK010 Are “sanitized” and “minimized” failure detail requirements defined with explicit redaction expectations and allowed/forbidden content? [Clarity, Spec §FR-010] — Evidence: Spec §FR-010 “MUST NOT include secrets, credentials, tokens, PII, or full external payload dumps...” and defines allowed per-item references.
+- [x] CHK011 Is “human-readable label” defined (format, stability, and whether localization is required) or explicitly deferred? [Clarity, Spec §FR-001] — Evidence: Spec §FR-001 “label MUST be a stable operator-facing description combining run type and scope/target (English-only in Phase 1; localization deferred to Phase 2).”
+- [x] CHK012 Is the “View run” link requirement specific about its destination and the surfaces where it must appear (immediate confirmation vs lifecycle notifications)? [Clarity, Spec §FR-005, Spec §FR-011] — Evidence: Spec §FR-005 “...‘View run’ link that opens the run detail view...” + Spec §FR-011 “Notifications MUST include a ‘View run’ link that opens the run detail view.”
+
+## Requirement Consistency
+
+- [x] CHK013 Is status vocabulary consistent across the spec (e.g., “completed” vs “succeeded/partially succeeded/failed”)? [Consistency, Spec §FR-004, Spec §SC-001] — Evidence: Spec §FR-004 and Spec §SC-001 use the same status set: `queued` / `running` / `succeeded` / `partially succeeded` / `failed`.
+- [x] CHK014 Are view-only constraints consistent across Clarifications, acceptance scenarios, and FR-014 (no manage controls in hub)? [Consistency, Spec §Clarifications, Spec §User Story 1, Spec §FR-014] — Evidence: Spec §Clarifications “Monitoring/Operations is view-only in Phase 1...” + Spec §User Story 1 scenario 5 “...view-only...” + Spec §FR-014 “...MUST be view-only...”
+- [x] CHK015 Are permissions consistent across scenarios and FR-009 (who can view vs start), including `Readonly` restrictions? [Consistency, Spec §Scope & Assumptions, Spec §User Story 2, Spec §FR-009] — Evidence: Spec §Scope & Assumptions roles bullet defines `Readonly` view-only; Spec §User Story 2 scenario 2 denies start; Spec §FR-009 forbids `Readonly` start/manage.
+- [x] CHK016 Are dedupe semantics consistent between user story scenarios, FR-006, and SC-003? [Consistency, Spec §User Story 3, Spec §FR-006, Spec §SC-003] — Evidence: Spec §User Story 3 scenario 2 (reuses existing run) + Spec §FR-006 (reuse queued/running) + Spec §SC-003 (no more than one active run).
+
+## Acceptance Criteria Quality
+
+- [x] CHK017 Does each user story have acceptance scenarios specific enough to validate without unstated assumptions (filters, links, permissions)? [Measurability, Spec §User Scenarios & Testing] — Evidence: Spec §User Story 1 scenarios specify default sort/window + filters + cross-tenant denial; Spec §User Story 2 scenarios specify notifications + background-unavailable behavior; Spec §User Story 3 scenarios specify findings link + failure summary.
+- [x] CHK018 Are success criteria measurable with defined measurement methods/proxies (e.g., how “under 30 seconds” is assessed)? [Measurability, Spec §Success Criteria] — Evidence: Spec §SC-001 “measured via timed operator walkthroughs...” + Spec §SC-002/SC-003/SC-004 include measurement notes.
+- [x] CHK019 Is “within 2 seconds under normal conditions” defined with explicit conditions/examples so it can be measured consistently? [Clarity, Spec §SC-002] — Evidence: Spec §SC-002 defines “normal conditions” (no active service degradation; typical tenant dataset sizes; excludes maintenance/outage windows).
+- [x] CHK020 Are “99% of repeated-start attempts” and “95% of cases” scoped precisely (population, timeframe, and measurement approach)? [Clarity, Spec §SC-003, Spec §SC-004] — Evidence: Spec §SC-003 and §SC-004 specify Phase 1 population, rolling 30-day window, and measurement approaches.
+
+## Scenario Coverage
+
+- [x] CHK021 Are requirements/scenarios present for system-initiated runs (no interactive initiator) and how they appear in Monitoring/Operations? [Coverage, Spec §Scope & Assumptions, Spec §User Story 1] — Evidence: Spec §Scope & Assumptions “System-initiated runs may exist... initiator shown as ‘System’.” + Spec §User Story 1 scenario 9.
+- [x] CHK022 Are scenarios present for cross-tenant access attempts for both list and detail views (and expected denial behavior)? [Coverage, Spec §User Story 1, Spec §FR-008] — Evidence: Spec §User Story 1 scenario 7 “...access is denied and no run data is disclosed.” + Spec §FR-008 “...MUST NOT disclose run existence or details.”
+- [x] CHK023 Do scenarios cover related-artifact behavior across outcomes (results link on success; safe failure summary on failure)? [Coverage, Spec §User Story 3, Spec §FR-012] — Evidence: Spec §User Story 3 scenario 3 (findings link on success) + scenario 4 (safe failure summary on failure) + Spec §FR-012.
+- [x] CHK024 Is the “permissions changed mid-run” scenario defined with explicit expected outcomes (viewability and notifications)? [Completeness, Spec §Edge Cases] — Evidence: Spec §Edge Cases “visibility is evaluated at time of access; and completion notifications are delivered only if the recipient remains authorized...”
+
+## Edge Case Coverage
+
+- [x] CHK025 Are Edge Cases written as explicit expected behaviors (not only open questions), or explicitly deferred with ownership/timing? [Completeness, Spec §Edge Cases] — Evidence: Spec §Edge Cases bullets are written as “If... MUST...” statements (no open placeholders).
+- [x] CHK026 Is drift eligibility defined (minimum data needed) and the user-facing outcome when eligibility is not met? [Completeness, Spec §User Story 3, Spec §Edge Cases] — Evidence: Spec §User Story 3 scenario 5 “...not enough eligible data...” + Spec §Edge Cases references failing/refusing with reason code (e.g., `insufficient_data`) and actionable message.
+- [x] CHK027 Are “very large scope” partial completion expectations defined (counts semantics; when partial vs failed applies)? [Clarity, Spec §User Story 1, Spec §FR-004] — Evidence: Spec §User Story 1 scenario 8 defines partial vs failed; Spec §FR-004 defines status semantics; Spec §Edge Cases addresses large scopes.
+- [x] CHK028 Are failure-display requirements defined for sensitive underlying errors (what is shown vs redacted)? [Clarity, Spec §Edge Cases, Spec §FR-010] — Evidence: Spec §Edge Cases “...only stable reason codes and short sanitized messages are shown...” + Spec §FR-010 “MUST NOT include secrets... tokens, PII, or full external payload dumps.”
+
+## Non-Functional Requirements
+
+- [x] CHK029 Are confidentiality constraints for stored/displayed failures explicit and framed as hard requirements (no secrets/tokens/full payload dumps)? [Non-Functional, Security, Spec §FR-010] — Evidence: Spec §FR-010 “MUST NOT include secrets, credentials, tokens, PII, or full external payload dumps.”
+- [x] CHK030 Are reliability expectations defined for background execution and notification delivery, including user-visible behavior when background execution is unavailable? [Completeness, Spec §Scope & Assumptions, Spec §FR-005, Spec §FR-011] — Evidence: Spec §FR-005 defines behavior when background execution is unavailable; Spec §FR-011 “If a notification cannot be delivered, Monitoring/Operations remains the source of truth...”
+- [x] CHK031 Are scalability expectations defined for Monitoring/Operations usage (expected run volume, retention horizon, and list size constraints) or explicitly deferred? [Completeness, Spec §Scope & Assumptions] — Evidence: Spec §Scope & Assumptions “Run retention horizon and scale targets... are deferred to Phase 2 (Owner: Product)...”
+- [x] CHK032 Are auditability requirements explicit beyond “initiator metadata” (what constitutes the audit trail for start/outcome/view)? [Completeness, Spec §Scope & Assumptions] — Evidence: Spec §Scope & Assumptions “Auditability for Phase 1 is achieved via the run record... and lifecycle notifications. Auditing ‘who viewed which run’ is deferred to Phase 2...”
+
+## Dependencies & Assumptions
+
+- [x] CHK033 Are Phase 1 adoption assumptions (which operations adopt the unified run model now vs later) stable, complete, and traceable to requirements? [Assumption, Spec §Scope & Assumptions, Spec §FR-002, Spec §FR-007, Spec §FR-013] — Evidence: Spec §Scope & Assumptions lists Phase 1 supported ops; Spec §FR-002 requires filtering for them; Spec §FR-007 and §FR-013 require run tracking for them.
+- [x] CHK034 Are environment dependencies (background execution) mapped to requirement-level behavior when unmet (operator guidance, degraded mode)? [Completeness, Spec §Scope & Assumptions, Spec §FR-005] — Evidence: Spec §Scope & Assumptions describes dependency; Spec §FR-005 defines clear error + no misleading “queued” confirmation when unavailable.
+- [x] CHK035 Is tenant isolation defined beyond “forbidden cross-tenant” with enough specificity to guide acceptance and review (e.g., scoping expectations)? [Clarity, Spec §FR-008, Spec §User Story 1] — Evidence: Spec §FR-008 “Tenant scoping MUST be applied before any filtering... MUST NOT disclose run existence or details.” + Spec §User Story 1 scenario 7.
+
+## Ambiguities & Conflicts
+
+- [x] CHK036 Are the roles `Owner`/`Manager`/`Operator`/`Readonly` defined or referenced so permission requirements are interpretable to reviewers? [Clarity, Spec §Scope & Assumptions] — Evidence: Spec §Scope & Assumptions roles bullet defines view/start capabilities per role.
+- [x] CHK037 Is “Monitoring/Operations” named consistently and distinguished from per-feature start surfaces (avoid competing synonyms)? [Consistency, Spec §User Story 1, Spec §Scope & Assumptions, Spec §FR-014] — Evidence: Spec consistently uses “Monitoring/Operations” for the hub and states start controls remain in feature areas (Spec §Scope & Assumptions; Spec §FR-014).
+- [x] CHK038 Is “run type” terminology consistent between Key Entities and filtering/permissions language (avoid conflicting synonyms like type/resource/action)? [Consistency, Spec §Key Entities, Spec §FR-002, Spec §FR-009] — Evidence: Spec §Key Entities defines “Run Type”; Spec §FR-002 and §FR-009 use “run type” for filtering/permissions.
+- [x] CHK039 Is traceability complete: does every “MUST” requirement have at least one corresponding scenario and/or measurable success criterion (or an explicit rationale if not)? [Traceability, Spec §Functional Requirements, Spec §User Scenarios & Testing, Spec §Success Criteria] — Evidence: US1 scenarios cover FR-002/003/004/008/014; US2 scenarios cover FR-005/011/013; US3 scenarios cover FR-006/007/012; Success Criteria (SC-001–SC-005) provide measurable outcomes for the overall feature.
+
+## Notes
+
+- Mark items as completed: `[x]`
+- Capture findings inline under the relevant item(s) (add links/quotes from `spec.md` as needed)
--- a/specs/053-unify-runs-monitoring/contracts/admin-pages.openapi.yaml
+++ b/specs/053-unify-runs-monitoring/contracts/admin-pages.openapi.yaml
@ -0,0 +1,54 @@
+openapi: 3.0.3
+info:
+  title: TenantPilot Admin Monitoring Contracts (Feature 053)
+  version: 0.1.0
+  description: |
+    Minimal page-render contracts for the Monitoring/Operations hub.
+
+    These pages must render from the database only (no external tenant calls)
+    and display only sanitized failure detail (no secrets/tokens/raw payload dumps).
+
+servers:
+  - url: /
+
+paths:
+  /admin/t/{tenantExternalId}/bulk-operation-runs:
+    get:
+      operationId: monitoringOperationsIndex
+      summary: Monitoring → Operations (tenant-scoped)
+      parameters:
+        - name: tenantExternalId
+          in: path
+          required: true
+          schema:
+            type: string
+      responses:
+        '200':
+          description: Page renders successfully.
+        '302':
+          description: Redirect to login when unauthenticated.
+
+  /admin/t/{tenantExternalId}/bulk-operation-runs/{bulkOperationRunId}:
+    get:
+      operationId: monitoringOperationsView
+      summary: Operation run detail (tenant-scoped)
+      parameters:
+        - name: tenantExternalId
+          in: path
+          required: true
+          schema:
+            type: string
+        - name: bulkOperationRunId
+          in: path
+          required: true
+          schema:
+            type: integer
+      responses:
+        '200':
+          description: Page renders successfully.
+        '302':
+          description: Redirect to login when unauthenticated.
+        '403':
+          description: Forbidden when attempting cross-tenant access.
+
+components: {}
--- a/specs/053-unify-runs-monitoring/data-model.md
+++ b/specs/053-unify-runs-monitoring/data-model.md
@ -0,0 +1,107 @@
+# Data Model: Unified Operations Runs + Monitoring Hub (053)
+
+This feature primarily standardizes and surfaces existing run records for long-running operations, and links operators to the underlying business artifacts (e.g., drift findings).
+
+## Entities
+
+## 1) BulkOperationRun (`bulk_operation_runs`)
+
+**Purpose:** Canonical tenant-scoped run record for long-running operations (Phase 1).
+
+**Model:** `App\Models\BulkOperationRun`
+
+**Key fields (existing):**
+- `id` (PK)
+- `tenant_id` (FK → tenants)
+- `user_id` (FK → users)
+- `resource` (string) — e.g. `drift`, `backup_set`
+- `action` (string) — e.g. `generate`, `add_policies`
+- `idempotency_key` (string|null)
+- `status` (string) — `pending`, `running`, `completed`, `completed_with_errors`, `failed`, `aborted`
+- counters: `total_items`, `processed_items`, `succeeded`, `failed`, `skipped`
+- `item_ids` (jsonb) — stable identifiers for the items/scope of the run
+  - Example (`drift.generate`): `{ scope_key, baseline_run_id, current_run_id }`
+  - Example (`backup_set.add_policies`): `{ backup_set_id, policy_ids, options }`
+- `failures` (jsonb|null) — sanitized failure details (including per-item failures for itemized operations)
+- `audit_log_id` (FK → audit_logs|null)
+- `created_at`, `updated_at`
+
+**Relationships:**
+- `BulkOperationRun belongsTo Tenant`
+- `BulkOperationRun belongsTo User`
+- `BulkOperationRun belongsTo AuditLog` (nullable)
+
+**Uniqueness / idempotency:**
+- Active-run uniqueness enforced via a partial unique index on `(tenant_id, idempotency_key)` for active statuses.
+- Idempotency keys are deterministic and stable per tenant + operation type + scope.
+
+**State transitions (storage):**
+- `pending → running → completed | completed_with_errors | failed | aborted`
+
+**Status mapping (operator UI semantics):**
+- `pending` → `queued`
+- `running` → `running`
+- `completed` → `succeeded`
+- `completed_with_errors` → `partially succeeded`
+- `failed`/`aborted` → `failed`
+
+**Failure entry shape (sanitized):**
+- `reason_code` (string, stable) + `reason` (short sanitized message)
+- for itemized runs: `item_id` per failure entry (and optional `type=skipped` for non-failure outcomes)
+
+---
+
+## 2) Finding (`findings`) — Drift results
+
+**Purpose:** Persisted analytic findings; drift findings are the primary “related artifact” for Drift generation runs.
+
+**Model:** `App\Models\Finding`
+
+**Key fields (existing, drift-related):**
+- `id` (PK)
+- `tenant_id` (FK → tenants)
+- `finding_type` (`drift`)
+- `scope_key` (string)
+- `baseline_run_id` (FK → inventory_sync_runs|null)
+- `current_run_id` (FK → inventory_sync_runs|null)
+- `fingerprint` (string; deterministic; unique per tenant)
+- `subject_type`, `subject_external_id`
+- `status` (`new|acknowledged`)
+- `evidence_jsonb` (jsonb; sanitized allowlist)
+- `created_at`, `updated_at`
+
+**Relationships:**
+- `Finding belongsTo Tenant`
+- `Finding belongsTo InventorySyncRun` via `baseline_run_id` and `current_run_id` (nullable)
+
+**Notes:**
+- Phase 1 can link operators from the drift run to findings through scope/baseline/current identifiers without introducing a new DB foreign key.
+- If later needed, introduce an explicit link (e.g., `findings.bulk_operation_run_id`) to make navigation and reporting easier.
+
+---
+
+## 3) InventorySyncRun (`inventory_sync_runs`) — Drift inputs
+
+**Purpose:** “Last observed” inventory run records used as baseline/current inputs for drift comparisons.
+
+**Model:** `App\Models\InventorySyncRun`
+
+**Relevant fields (existing):**
+- `tenant_id`
+- `status`
+- `selection_hash` (used as `scope_key`)
+- `finished_at`
+
+---
+
+## 4) Notification Event (DB notifications)
+
+**Purpose:** Persist run lifecycle notifications (queued/completed) linking operators to the run detail page.
+
+**Storage:** Laravel Notifications (DB channel).
+
+**Payload (target):**
+- tenant identifier
+- run identifier + type (`bulk_operation_run`)
+- status bucket (queued/running/succeeded/partial/failed)
+- summary counts and a safe error summary (when applicable)
--- a/specs/053-unify-runs-monitoring/plan.md
+++ b/specs/053-unify-runs-monitoring/plan.md
@ -0,0 +1,150 @@
+# Implementation Plan: Unified Operations Runs + Monitoring Hub (053)
+
+**Branch**: `feat/053-unify-runs-monitoring` | **Date**: 2026-01-16 | **Spec**: `/Users/ahmeddarrazi/Documents/projects/TenantAtlas/specs/053-unify-runs-monitoring/spec.md` ([spec.md](spec.md))
+**Input**: Feature specification from `/Users/ahmeddarrazi/Documents/projects/TenantAtlas/specs/053-unify-runs-monitoring/spec.md`
+
+**Note**: This plan is filled in by the `/speckit.plan` command. See `/Users/ahmeddarrazi/Documents/projects/TenantAtlas/.specify/scripts/` for helper scripts.
+
+## Summary
+
+Unify how long-running operations are tracked and monitored by using the existing tenant-scoped run record (`BulkOperationRun`) as the canonical “operation run”, surfacing it in a single Monitoring/Operations hub (view-only, tenant-scoped, role-aware), and standardizing status semantics, notifications, failure detail minimization, and idempotent de-duplication.
+
+Phase 1 adoption scope (per clarifications): Drift generation + Backup Set “Add Policies”.
+
+## Technical Context
+
+**Language/Version**: PHP 8.4.15 (Laravel 12)  
+**Primary Dependencies**: Filament v4, Livewire v3  
+**Storage**: PostgreSQL (JSONB for run `item_ids` and `failures`)  
+**Testing**: Pest v4 (PHPUnit v12)  
+**Target Platform**: Web application (Sail-first locally, Dokploy-first deploy)  
+**Project Type**: web  
+**Performance Goals**: Monitoring/Operations index renders in <1s for the most recent ~250 runs; start actions confirm and provide a “View run” link within 2 seconds (aligns with SC-002).  
+**Constraints**: Monitoring pages are DB-only and view-only; strict tenant isolation; no secrets/tokens stored; run failures use stable reason codes + short sanitized messages; itemized runs store per-item failures (sanitized).  
+**Scale/Scope**: Tenant-scoped run history across multiple modules; Phase 1 covers drift generation + backup set “add policies”, with more run producers added in later phases.
+
+## Constitution Check
+
+*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.*
+
+- Inventory-first: PASS. Monitoring uses persisted run records; drift generation is based on inventory sync “last observed” state and stores findings (not raw snapshots).
+- Read/write separation: PASS. Monitoring/Operations is view-only (no start/rerun/cancel/delete). Write actions remain in their feature UIs and already use queued jobs + auditability.
+- Graph contract path: PASS (no new Graph calls introduced by Monitoring). Existing Graph calls remain behind existing abstractions and must not occur during Monitoring page render.
+- Deterministic capabilities: N/A for this feature (no new capability resolver). Existing idempotency key builder remains deterministic.
+- Tenant isolation: PASS. Run list/view/start remain tenant-scoped; cross-tenant access is forbidden (403).
+- Automation: PASS. Active-run de-duplication uses deterministic idempotency keys + partial unique indexes; runs remain observable with status + counts + safe errors.
+- Data minimization: PASS. Failures are sanitized/minimized; no secrets/tokens/raw external payload dumps stored or displayed.
+
+**Gate status (pre-Phase 0)**: PASS (no violations).
+
+**Gate status (post-Phase 1)**: PASS (design artifacts present: research.md, data-model.md, contracts/*, quickstart.md).
+
+## Project Structure
+
+### Documentation (this feature)
+
+```text
+/Users/ahmeddarrazi/Documents/projects/TenantAtlas/specs/053-unify-runs-monitoring/
+├── plan.md                     # This file (/speckit.plan command output)
+├── spec.md                     # Feature specification (input)
+├── checklists/
+│   └── requirements.md         # Spec quality checklist
+├── research.md                 # Phase 0 output
+├── data-model.md               # Phase 1 output
+├── quickstart.md               # Phase 1 output
+├── contracts/                  # Phase 1 output
+└── tasks.md                    # Phase 2 output (/speckit.tasks command - NOT created by /speckit.plan)
+```
+
+### Source Code (repository root)
+```text
+app/
+├── Filament/
+│   ├── Pages/
+│   └── Resources/
+├── Jobs/
+├── Models/
+├── Policies/
+├── Services/
+└── Support/
+
+config/
+
+database/
+└── migrations/
+
+routes/
+
+tests/
+├── Feature/
+└── Unit/
+```
+
+**Structure Decision**: Laravel web application. Implement Monitoring/Operations primarily via Filament (Resources/Pages) and reuse existing run-record primitives (`bulk_operation_runs`) with tenant-scoped policies and Pest tests.
+
+## Key Implementation Decisions (Pinned)
+
+- **Phase 1 scope**: Monitoring/Operations hub + Drift generation + Backup Set “Add Policies”.
+- **Monitoring permissions**: `Owner`, `Manager`, `Operator`, `Readonly` can view. `Readonly` is strictly view-only.
+- **Monitoring guardrail**: Monitoring/Operations is view-only in Phase 1 (no start/rerun/cancel/delete).
+- **Status semantics**: UI-level statuses are consistent and testable:
+  - `partially succeeded` = at least one success and at least one failure
+  - `failed` = zero successes (or the run could not proceed)
+- **Failure detail**: Stable reason codes + short sanitized messages; itemized operations include per-item failures (sanitized).
+
+## Execution Model
+
+### Run record primitive
+
+- Canonical run record: `App\Models\BulkOperationRun` (tenant-scoped) for Phase 1.
+- Producers in Phase 1:
+  - Drift generation: `resource=drift`, `action=generate`
+  - Backup Set “Add Policies”: `resource=backup_set`, `action=add_policies` (or existing canonical action naming)
+
+### Status mapping (storage ↔ UI semantics)
+
+The UI MUST present consistent meanings while allowing storage to keep existing vocabulary:
+
+- `pending` → `queued`
+- `running` → `running`
+- `completed` → `succeeded`
+- `completed_with_errors` → `partially succeeded`
+- `failed` / `aborted` → `failed`
+
+### Idempotency & de-duplication
+
+- Deterministic idempotency key per tenant + operation type + scope via `App\Support\RunIdempotency`.
+- Active-run reuse: if an identical run is `pending` or `running`, reuse it (return the existing run and link to it).
+- Race reduction: rely on the existing partial unique index for active runs and handle collisions by “find existing and reuse”.
+
+### Notifications
+
+- Use DB notifications for “queued” and “completed” lifecycle events, linking to the run detail page.
+- Notifications and persisted run failures must remain sanitized (no secrets/tokens/raw payloads).
+
+### Monitoring/Operations hub
+
+- Central list + filters for the active tenant:
+  - filter by `resource`/`action`, status bucket (queued/running/succeeded/partial/failed), and time range
+  - drill-down to run detail (status + counts + sanitized failures + item identifiers)
+- View-only: no hub actions to start, rerun, cancel, or delete runs.
+
+## Definition of Done (ends at Phase 2 planning)
+
+### Phase 2 (MVP implementation readiness)
+
+- Monitoring/Operations navigation exists and lists tenant runs with the required filters and drill-down.
+- Role guardrail enforced: `Readonly` can view list + detail but has no action controls.
+- Status bucket semantics are consistent and testable (including partial vs failed).
+- Drift generation and Backup Set “Add Policies” runs are visible and linkable from their feature entry points and from Monitoring/Operations.
+- Design artifacts exist and are referenced by this plan:
+  - `/Users/ahmeddarrazi/Documents/projects/TenantAtlas/specs/053-unify-runs-monitoring/research.md`
+  - `/Users/ahmeddarrazi/Documents/projects/TenantAtlas/specs/053-unify-runs-monitoring/data-model.md`
+  - `/Users/ahmeddarrazi/Documents/projects/TenantAtlas/specs/053-unify-runs-monitoring/contracts/`
+  - `/Users/ahmeddarrazi/Documents/projects/TenantAtlas/specs/053-unify-runs-monitoring/quickstart.md`
+
+## Complexity Tracking
+
+> **Fill ONLY if Constitution Check has violations that must be justified**
+
+N/A (no constitution violations anticipated)
--- a/specs/053-unify-runs-monitoring/quickstart.md
+++ b/specs/053-unify-runs-monitoring/quickstart.md
@ -0,0 +1,27 @@
+# Quickstart: Unified Operations Runs + Monitoring Hub (053)
+
+## Goal
+
+Provide a single Monitoring/Operations area (view-only) to observe tenant-scoped long-running runs with consistent status semantics, safe failure visibility, and links to related artifacts. Phase 1 scope includes Drift generation and Backup Set “Add Policies”.
+
+## Local development
+
+- Bring Sail up: `./vendor/bin/sail up -d`
+- Run migrations: `./vendor/bin/sail artisan migrate`
+- Run a queue worker (separate terminal): `./vendor/bin/sail artisan queue:work`
+
+## Testing
+
+Run the most relevant tests first:
+
+- Tenant scoping for runs: `./vendor/bin/sail artisan test tests/Feature/RunAuthorizationTenantIsolationTest.php`
+- Drift generation run dispatch: `./vendor/bin/sail artisan test tests/Feature/Drift/DriftGenerationDispatchTest.php`
+- Drift job notifications + failure details: `./vendor/bin/sail artisan test tests/Feature/Drift/GenerateDriftFindingsJobNotificationTest.php`
+- Backup Set “Add Policies” job orchestration: `./vendor/bin/sail artisan test tests/Feature/BackupSets/AddPoliciesToBackupSetJobTest.php`
+- Status bucket semantics: `./vendor/bin/sail artisan test tests/Unit/BulkOperationRunStatusBucketTest.php`
+
+## Operational notes
+
+- Monitoring pages must remain DB-only (no external tenant calls on render).
+- Run failures must stay sanitized and must not contain secrets/tokens.
+- Ensure queue workers are running in staging/production so runs complete outside interactive sessions.
--- a/specs/053-unify-runs-monitoring/research.md
+++ b/specs/053-unify-runs-monitoring/research.md
@ -0,0 +1,106 @@
+# Research: Unified Operations Runs + Monitoring Hub (053)
+
+This document resolves Phase 0 open questions and records design choices for Feature 053.
+
+## Decisions
+
+### 1) Canonical run record (Phase 1)
+
+**Decision:** Reuse the existing `bulk_operation_runs` / `App\Models\BulkOperationRun` as the canonical “operation run” record for Phase 1.
+
+**Rationale:**
+- The codebase already uses `BulkOperationRun` for long-running background work (including Drift generation and Backup Set “Add Policies”).
+- It already supports tenant scoping, initiator attribution, counts, and safe failure persistence.
+- Avoids a high-risk cross-feature migration before we have proven consistent semantics across modules.
+
+**Alternatives considered:**
+- **Create a new generic `operation_runs` (+ optional `operation_run_items`) model** and migrate all producers to it.
+  - Rejected (Phase 1): higher schema + refactor cost, higher coordination risk, and would slow down delivering the Monitoring hub.
+
+### 2) Monitoring/Operations hub surface
+
+**Decision:** Implement the Monitoring/Operations hub by evolving the existing Filament `BulkOperationRunResource` (navigation group/label + filters), rather than creating a new custom monitoring page in Phase 1.
+
+**Rationale:**
+- The resource already provides a tenant-scoped list and a run detail view.
+- Small changes deliver high value quickly and reduce risk.
+
+**Alternatives considered:**
+- **New “Monitoring → Operations” Filament Page + bespoke table/detail.**
+  - Rejected (Phase 1): duplicates existing capabilities and increases maintenance.
+
+### 3) View-only guardrail and viewer roles
+
+**Decision:** Monitoring/Operations is view-only in Phase 1 and is visible to tenant roles `Owner`, `Manager`, `Operator`, and `Readonly`. Start/re-run controls remain in the respective feature UIs.
+
+**Rationale:**
+- Adding run management actions implies introducing cancellation semantics, locks, permission matrices, and race handling across producers.
+- View-only delivers the primary value (transparency + auditability) without expanding scope.
+
+**Alternatives considered:**
+- **Add `Rerun` / `Cancel` actions in the hub.**
+  - Rejected (Phase 1): scope expansion into “run management”.
+- **Restrict viewing to non-Readonly roles.**
+  - Rejected: increases “what happened?” support loops; viewing is safe when sanitized.
+
+### 4) Status semantics and mapping
+
+**Decision:** Standardize UI-level status semantics as `queued → running → (succeeded | partially succeeded | failed)` while allowing underlying storage to keep its current status vocabulary.
+
+- `partially succeeded` = at least one success and at least one failure.
+- `failed` = zero successes (or the run could not proceed).
+- `BulkOperationRun.status` mapping: `pending→queued`, `running→running`, `completed→succeeded`, `completed_with_errors→partially succeeded`, `failed/aborted→failed`.
+
+**Rationale:**
+- Keeps the operator-facing meaning consistent and testable without forcing a broad “rename statuses everywhere” refactor.
+
+**Alternatives considered:**
+- **Normalize all run status values across all run tables immediately.**
+  - Rejected (Phase 1): broad blast radius across many features and tests.
+
+### 5) Failure detail storage
+
+**Decision:** Persist stable reason codes and short sanitized messages for failures; itemized operations also store a sanitized per-item failures list.
+
+**Rationale:**
+- Operators and support should understand failures without reading server logs.
+- Per-item failures avoid rerunning large operations just to identify the affected item.
+
+**Alternatives considered:**
+- **Summary-only failure storage.**
+  - Rejected: loses actionable “which item failed?” detail for itemized runs.
+- **Logs-only (no persisted failure detail).**
+  - Rejected: weaker observability and not aligned with “safe, actionable failures”.
+
+### 6) Idempotency & de-duplication
+
+**Decision:** Use deterministic idempotency keys and active-run reuse as the primary dedupe mechanism:
+- Key builder: `App\Support\RunIdempotency::buildKey(...)` with stable, sorted context.
+- Active-run lookup: reuse when status is active (`pending`/`running`).
+- Race reduction: rely on the existing partial unique index for active runs and handle collisions by finding and reusing the existing run.
+
+**Rationale:**
+- Aligns with the constitution (“Automation must be Idempotent & Observable”).
+- Durable across restarts and observable in the database.
+
+**Alternatives considered:**
+- **Cache-only locks without persisted keys.**
+  - Rejected: less observable and easier to break across deploys/restarts.
+
+### 7) Phase 1 producer scope
+
+**Decision:** Phase 1 adopts the unified monitoring semantics for:
+- Drift generation (`drift.generate`)
+- Backup Set “Add Policies” (`backup_set.add_policies`)
+
+**Rationale:**
+- Both are already using `BulkOperationRun` and provide immediate value in the Monitoring hub.
+- Keeps Phase 1 bounded while proving the pattern across two modules.
+
+**Alternatives considered:**
+- **Include every long-running producer in one pass.**
+  - Rejected (Phase 1): larger blast radius and higher coordination cost.
+
+## Notes
+
+- Retention/purge policy for run history should follow existing platform retention controls (defer to planning if changes are required).
--- a/specs/053-unify-runs-monitoring/spec.md
+++ b/specs/053-unify-runs-monitoring/spec.md
@ -0,0 +1,138 @@
+# Feature Specification: Unified Operations Runs + Monitoring Hub (053)
+
+**Feature Branch**: `feat/053-unify-runs-monitoring`  
+**Created**: 2026-01-15  
+**Status**: Draft  
+**Input**: User description: "Unify long-running operations into a consistent, observable run model and provide a single Monitoring/Operations hub with tenant-scoped authorization, consistent status/count semantics, safe failure visibility, duplicate prevention, and Drift generation as a first-class tracked operation."
+
+## Clarifications
+
+### Session 2026-01-15
+
+- Q: Who can view runs in Monitoring/Operations? → A: `Owner`, `Manager`, `Operator`, and `Readonly` (view-only; `Readonly` sees sanitized details but no action controls).
+- Q: Which operations are included in Phase 1? → A: Monitoring/Operations hub + Drift generation + Backup Set “Add Policies”.
+- Q: Should Monitoring/Operations include run management actions (rerun/cancel/delete)? → A: No; Monitoring/Operations is view-only in Phase 1 (start/re-run remain in feature UIs).
+- Q: How is “partially succeeded” determined vs “failed”? → A: `partially succeeded` means both successes and failures occurred; `failed` means nothing succeeded (or the run could not proceed).
+- Q: What failure detail should be stored and shown for runs? → A: Stable reason codes + short sanitized messages; itemized operations also include per-item failures (sanitized).
+
+## User Scenarios & Testing *(mandatory)*
+
+### User Story 1 - Monitor operations in one place (Priority: P1)
+
+As an operator, I want a single Monitoring/Operations area where I can see what operations are queued/running/succeeded/partially succeeded/failed for my tenant and drill into details, so I can quickly answer “what is happening” without jumping between modules.
+
+**Why this priority**: Provides immediate operational visibility, reduces time-to-diagnose failures, and prevents “run sprawl” (different screens/status meanings per module).
+
+**Independent Test**: Seed multiple runs with different types/statuses and verify the hub lists them tenant-scoped, supports filtering, and a run detail view is available with status, timing, summary counts, and safe error summaries.
+
+**Acceptance Scenarios**:
+
+1. **Given** I am signed into tenant A, **When** I open Monitoring → Operations, **Then** I see runs for tenant A sorted most-recent-first and the list defaults to the last 30 days.
+2. **Given** the list contains multiple run types and statuses, **When** I filter by run type and status, **Then** I only see matching runs and can open a run detail page for any result.
+3. **Given** I need an older or narrower timeframe, **When** I set a custom time range, **Then** the list updates to match the selected range.
+4. **Given** I am a `Readonly` user in tenant A, **When** I open a run detail page, **Then** I can view status and sanitized failure information but I do not see controls to start, rerun, cancel, or delete runs.
+5. **Given** I can view Monitoring/Operations, **When** I use the hub, **Then** it is view-only (no controls to start, rerun, cancel, or delete runs).
+6. **Given** an itemized run has failures, **When** I open the run detail page, **Then** I can see which items failed with stable reason codes and short sanitized messages.
+7. **Given** I attempt to access a run from another tenant (via list or direct link), **When** I request it, **Then** access is denied and no run data is disclosed.
+8. **Given** a run has both successes and failures, **When** I view it, **Then** its outcome is “partially succeeded”; **And given** it has zero successes (or cannot proceed), **Then** its outcome is “failed”.
+9. **Given** a run was initiated by the system, **When** I view it, **Then** the initiator is shown as “System”.
+
+---
+
+### User Story 2 - Start long-running actions without waiting (Priority: P2)
+
+As an operator, when I trigger a long-running action, I want an immediate confirmation with a “View run” link so I can keep working while the operation completes in the background.
+
+**Why this priority**: Prevents timeouts and long waits, improves reliability under retries/double-clicks, and standardizes the start experience across modules.
+
+**Independent Test**: Start a supported operation (e.g., drift generation) and verify an operation run is created/reused, the UI returns quickly with a “View run” link, and the run progresses to a terminal status.
+
+**Acceptance Scenarios**:
+
+1. **Given** I have permission to start a supported operation in tenant A, **When** I start it, **Then** I receive immediate confirmation with a “View run” link and the run shows as queued or running.
+2. **Given** I am a `Readonly` user in tenant A, **When** I attempt to start a supported operation, **Then** the system denies the request and no new run is created.
+3. **Given** I start Backup Set “Add Policies”, **When** I submit the selection, **Then** I receive immediate confirmation with a “View run” link and the run is visible in Monitoring/Operations.
+4. **Given** I start a supported operation successfully, **When** it is queued, **Then** I receive a “Queued” notification with a “View run” link to the run detail view.
+5. **Given** my run reaches a terminal outcome, **When** that occurs, **Then** I receive a completion notification stating `succeeded`, `partially succeeded`, or `failed`, including summary counts when applicable.
+6. **Given** background execution is unavailable, **When** I attempt to start a supported operation, **Then** I receive a clear message that the operation cannot be queued and I do not receive a misleading “queued” confirmation.
+
+---
+
+### User Story 3 - Drift generation is observable like other operations (Priority: P3)
+
+As an operator, I want drift generation to behave like any other operation: it creates a run, shows progress and outcomes, and links to results so drift is easy to monitor and troubleshoot.
+
+**Why this priority**: Drift is an operator-facing workflow; making it observable avoids confusion and reduces support burden when it fails or takes time.
+
+**Independent Test**: Trigger drift generation for a defined scope, verify it appears in the monitoring hub, reaches a terminal status, and provides either a results link or a safe failure summary.
+
+**Acceptance Scenarios**:
+
+1. **Given** drift generation is available for a selected scope, **When** I trigger “Generate drift now”, **Then** I see a drift run in Monitoring and can open its details.
+2. **Given** drift generation is already queued/running for the same scope, **When** I trigger it again, **Then** the system reuses the existing run and does not start a duplicate.
+3. **Given** drift generation succeeds for a scope, **When** I open the run detail view, **Then** I see a link to the drift findings produced for that scope.
+4. **Given** drift generation fails, **When** I open the run detail view, **Then** I see a safe failure summary (reason code + short sanitized message) and no sensitive data.
+5. **Given** drift generation is requested but there is not enough eligible data to compare, **When** I trigger it, **Then** the system produces no findings and communicates a clear, actionable reason (for example, “insufficient data”).
+
+---
+
+### Edge Cases
+
+- If drift generation is requested for a scope without enough eligible data to compare, the system MUST refuse to start the operation or MUST complete the run as `failed` with a stable reason code (for example, `insufficient_data`) and a short, actionable message.
+- If repeated start attempts occur for the same tenant + run type + scope while an existing run is `queued` or `running`, the system MUST reuse the existing run and MUST NOT start duplicate background work.
+- If the initiator’s permissions change while a run is in progress, the run continues; visibility is evaluated at time of access; and completion notifications are delivered only if the recipient remains authorized to view the run.
+- Failure details MUST remain sanitized even when underlying errors contain sensitive data: only stable reason codes and short sanitized messages are shown; secrets/tokens/raw payload dumps are never shown.
+- For very large scopes, the system MAY summarize or truncate per-item failure listings in the UI, but it MUST preserve accurate summary counts and MUST indicate when failure details are truncated.
+
+## Requirements *(mandatory)*
+
+**Constitution alignment (required):** If this feature introduces any new external tenant reads/writes or any write/change behavior,
+the spec MUST describe safety gates (preview/confirmation/audit), tenant isolation, auditability, and tests. This feature is primarily about observability and standardization; monitoring views MUST be read-only and MUST NOT trigger external data collection.
+
+### Scope & Assumptions
+
+- This feature standardizes tracking and monitoring for long-running operations that already exist in the product.
+- **Phase 1 supported operations** are: Drift generation; Backup Set “Add Policies”. All other candidate operations are explicitly deferred to Phase 2+ adoption (for example: inventory sync, directory group sync, snapshot/backup capture, restore execution, cross-tenant comparison/promotion).
+- System-initiated runs may exist (for example, scheduled operations); they appear in Monitoring/Operations with initiator shown as “System”. Notification routing for system-initiated runs is deferred to Phase 2 (Owner: Product) to avoid unintended noise.
+- **Roles in this spec**: `Owner`, `Manager`, and `Operator` can view Monitoring/Operations and can start Phase 1 supported operations; `Readonly` can view Monitoring/Operations but cannot start or manage runs and must not see those controls.
+- Cross-tenant monitoring aggregation is out of scope; monitoring is tenant-scoped.
+- Advanced dashboards (charts, badges, progress widgets) are out of scope; focus is on consistent run tracking, filtering, and drill-down.
+- Run retention horizon and scale targets (volume per tenant, archiving/export) are deferred to Phase 2 (Owner: Product) to align with real usage data and storage constraints. Phase 1 focuses on operational clarity and defaults the list view to a recent time window.
+- This feature assumes background execution is enabled in each environment so long-running operations can complete outside of interactive user sessions; when it is unavailable, the system must communicate clearly and must not mislead operators into thinking work is queued.
+- Monitoring/Operations is view-only in Phase 1; run start/re-run controls remain in their respective feature areas.
+- Auditability for Phase 1 is achieved via the run record (initiator, timestamps, outcome, counts, safe failure summary) and lifecycle notifications. Auditing “who viewed which run” is deferred to Phase 2 (Owner: Product).
+
+### Functional Requirements
+
+- **FR-001**: System MUST track each supported long-running operation as a tenant-scoped run with: run type, scope/target (when applicable), status, timestamps (created/started/finished), initiator (user or system), and a human-readable label. The label MUST be a stable operator-facing description combining run type and scope/target (English-only in Phase 1; localization deferred to Phase 2).
+- **FR-002**: System MUST provide a Monitoring/Operations area that lists runs for the current tenant, sorted most-recent-first by default, defaulting to a recent time window (last 30 days), and supporting filtering by run type, status, and time range. Status filter values MUST include: `queued`, `running`, `succeeded`, `partially succeeded`, `failed`. Run type filtering MUST include the Phase 1 supported operations (Drift generation; Backup Set “Add Policies”).
+- **FR-003**: System MUST provide a run detail view that shows status/outcome, timing, summary counts, and safe failure summaries. For itemized operations (operations that process a set of items), counts MUST include `total`, `succeeded`, `failed`, and `skipped` (if applicable).
+- **FR-004**: System MUST use consistent run status semantics across run types using the Phase 1 status set: `queued`, `running`, `succeeded`, `partially succeeded`, `failed`. Status meanings MUST be unambiguous: `partially succeeded` indicates at least one success and at least one failure; `failed` indicates zero successes (or the run could not proceed). Cancellation/abort outcomes are deferred to Phase 2.
+- **FR-005**: When an operator starts a supported long-running operation, the system MUST provide immediate confirmation and a “View run” link that opens the run detail view without blocking on completion. If background execution is unavailable, the system MUST provide a clear error and MUST NOT present a misleading “queued” confirmation.
+- **FR-006**: System MUST avoid duplicate runs for the same tenant + run type + scope when an identical run is already `queued` or `running` by reusing the existing run. “Identical” means the same tenant, the same run type, the same scope/target, and the same effective inputs (for example: the same drift scope selection; the same backup set and selected policies). The initiator MUST NOT be part of the identity for duplicate prevention.
+- **FR-007**: Drift generation MUST be tracked as a run and MUST surface completion status and either a link to produced findings or an actionable, safe failure summary.
+- **FR-008**: Run list, run view, and run start actions MUST be tenant-scoped and forbidden cross-tenant. Tenant scoping MUST be applied before any filtering or lookup to prevent cross-tenant data leakage, and cross-tenant access attempts MUST NOT disclose run existence or details.
+- **FR-009**: Run visibility and run start actions MUST be permission-gated by run type (least privilege). By default, `Owner`, `Manager`, `Operator`, and `Readonly` can view runs, but `Readonly` MUST NOT be able to start or manage runs (and must not see those controls).
+- **FR-010**: Failure information stored and displayed for runs MUST be sanitized and minimized; it MUST NOT include secrets, credentials, tokens, PII, or full external payload dumps. Runs MUST store stable reason codes with short sanitized messages. Itemized operations MUST additionally store a sanitized per-item failures list that identifies the affected item using a non-sensitive reference (for example, an item name and/or ID that is safe to display) plus reason code and short message.
+- **FR-011**: System MUST provide consistent user notifications for run lifecycle events: when queued and when reaching a terminal outcome (`succeeded` / `partially succeeded` / `failed`). Notifications MUST include a “View run” link that opens the run detail view. Phase 1 notifications MUST be delivered to the initiating user; notification routing for system-initiated runs is deferred to Phase 2. If a notification cannot be delivered, Monitoring/Operations remains the source of truth for run status.
+- **FR-012**: Where a run produces or relates to a separate business artifact (for example, drift findings), the run detail view MUST provide a link to that artifact.
+- **FR-013**: Backup Set “Add Policies” MUST be tracked as a run and MUST surface terminal outcome status, summary counts, and a link back to the related backup set context.
+- **FR-014**: Monitoring/Operations MUST be view-only in Phase 1 and MUST NOT offer controls to start, rerun, cancel, or delete runs.
+
+### Key Entities *(include if feature involves data)*
+
+- **Run**: A tenant-scoped record representing the execution of a long-running operation, including status, timestamps, summary counts, and safe failure information.
+- **Failure Detail**: For a run, a stable reason code plus a short sanitized message; for itemized operations, a sanitized list of per-item failures to identify affected items without exposing sensitive data.
+- **Run Type**: A classification of what the run does (e.g., inventory sync, backup capture, restore, drift generation) used for filtering, permissions, and consistent status/count semantics.
+- **Run Scope/Target**: The object or scope the run applies to (e.g., a backup set, a policy, a drift scope), used for duplicate prevention and navigation.
+- **Related Artifact**: A separate business record produced by an operation (e.g., restore details, drift findings) that can be linked from the run.
+
+## Success Criteria *(mandatory)*
+
+### Measurable Outcomes
+
+- **SC-001**: An operator can determine the current state (`queued` / `running` / `succeeded` / `partially succeeded` / `failed`) of any recent operation in under 30 seconds using Monitoring/Operations (measured via timed operator walkthroughs using Monitoring/Operations only).
+- **SC-002**: Starting a Phase 1 supported long-running operation provides user-visible confirmation and a “View run” link within 2 seconds under normal conditions (normal conditions: no active service degradation and typical tenant dataset sizes; excludes maintenance/outage windows; measured via timed start flows during operator walkthroughs).
+- **SC-003**: For identical Phase 1 operation requests (same tenant + run type + scope + effective inputs), the system creates no more than one active run at a time (`queued` or `running`) in at least 99% of repeated-start attempts over a rolling 30-day window (measured by sampling repeated-start attempts and counting resulting active runs).
+- **SC-004**: For Phase 1 terminal runs, operators can identify a clear outcome (`succeeded` / `partially succeeded` / `failed`) and a short, non-sensitive failure reason (when applicable) using Monitoring/Operations without inspecting server logs in at least 95% of cases over a rolling 30-day window (measured via support/operator triage review of run records).
+- **SC-005**: Operator- or support-reported incidents caused by “unknown/stuck status” long-running operations decrease by at least 50% within one release cycle after Phase 1 adoption (measured via support ticket tagging/categorization).
--- a/specs/053-unify-runs-monitoring/tasks.md
+++ b/specs/053-unify-runs-monitoring/tasks.md
@ -0,0 +1,174 @@
+---
+
+description: "Task list for implementing Unified Operations Runs + Monitoring Hub (053)"
+---
+
+# Tasks: Unified Operations Runs + Monitoring Hub (053)
+
+**Input**: Design documents from `/Users/ahmeddarrazi/Documents/projects/TenantAtlas/specs/053-unify-runs-monitoring/`  
+**Prerequisites**: plan.md (required), spec.md (required), research.md, data-model.md, contracts/, quickstart.md
+
+**Tests**: Not explicitly requested in spec.md. Add/adjust Pest tests as needed during implementation; validate with the existing test suite.
+
+**Organization**: Tasks are grouped by user story so each story can be implemented and validated independently.
+
+## Format: `- [ ] T### [P?] [US#?] Description with file path`
+
+- **[P]**: Can run in parallel (different files, no dependencies)
+- **[US#]**: User story mapping (US1/US2/US3). Setup/Foundational/Polish tasks have no story label.
+
+## Path Conventions (Laravel)
+
+- App code: `app/`
+- Filament admin: `app/Filament/`
+- Livewire: `app/Livewire/`
+- Jobs: `app/Jobs/`
+- DB: `database/migrations/`
+- Views: `resources/views/`
+- Tests (Pest): `tests/Feature/`, `tests/Unit/`
+
+---
+
+## Phase 1: Setup (Shared Infrastructure)
+
+**Purpose**: Confirm baseline assumptions and align documentation artifacts with the codebase.
+
+- [x] T001 [P] Confirm “Monitoring/Operations hub = evolve BulkOperationRunResource” decision remains correct and update notes if needed in specs/053-unify-runs-monitoring/research.md
+- [x] T002 [P] Verify Filament URLs match contracts (index/view) and update specs/053-unify-runs-monitoring/contracts/admin-pages.openapi.yaml if paths differ
+
+---
+
+## Phase 2: Foundational (Blocking Prerequisites)
+
+**Purpose**: Shared building blocks required by all user stories.
+
+**⚠️ CRITICAL**: No user story work should begin until this phase is complete.
+
+- [x] T003 Add `runType()` and `statusBucket()` accessors (queued/running/succeeded/partial/failed) to app/Models/BulkOperationRun.php
+- [x] T004 [P] Confirm `Readonly` users can view run list/detail tenant-scoped (and only view) by reviewing/updating app/Policies/BulkOperationRunPolicy.php
+
+**Checkpoint**: Foundation ready — Monitoring UI and run producers can reuse consistent status semantics.
+
+---
+
+## Phase 3: User Story 1 - Monitor operations in one place (Priority: P1) 🎯 MVP
+
+**Goal**: Provide a single Monitoring/Operations area to list and drill into tenant runs with consistent status semantics and safe failure visibility.
+
+**Independent Test**: Visit Monitoring → Operations for a tenant with runs; filter by type/status; open a run and confirm counts + sanitized failures are visible; verify `Readonly` sees view-only UI.
+
+### Implementation
+
+- [x] T005 [US1] Move Operations runs into “Monitoring” navigation and label it “Operations” in app/Filament/Resources/BulkOperationRunResource.php
+- [x] T006 [US1] Render status badges using `statusBucket()` (not raw status) in app/Filament/Resources/BulkOperationRunResource.php
+- [x] T007 [US1] Add filters for run type (`resource.action`) and status bucket in app/Filament/Resources/BulkOperationRunResource.php
+- [x] T008 [US1] Add time range filter (created_at from/to) in app/Filament/Resources/BulkOperationRunResource.php
+- [x] T009 [US1] Add a “Related” section on the run detail view linking to the relevant feature context (e.g., Backup Set for `backup_set.add_policies`) in app/Filament/Resources/BulkOperationRunResource.php
+
+**Checkpoint**: US1 complete — operators can monitor and drill into runs in one place.
+
+---
+
+## Phase 4: User Story 2 - Start long-running actions without waiting (Priority: P2)
+
+**Goal**: Starting a supported long-running operation is non-blocking and provides immediate confirmation + “View run” link; unauthorized users cannot start.
+
+**Independent Test**: Trigger Drift generation and Backup Set “Add Policies”; confirm immediate feedback with “View run” link; confirm `Readonly` cannot start drift generation and no run is created.
+
+### Implementation
+
+- [x] T010 [US2] Prevent drift generation from being started by `Readonly` users (blocked state + message) in app/Filament/Pages/DriftLanding.php
+- [x] T011 [US2] Emit a queued DB notification with “View run” link when Drift generation is queued in app/Filament/Pages/DriftLanding.php
+- [x] T012 [P] [US2] Emit Drift completion and failure DB notifications with “View run” link in app/Jobs/GenerateDriftFindingsJob.php
+
+**Checkpoint**: US2 complete — start UX is consistent and permission-gated.
+
+---
+
+## Phase 5: User Story 3 - Drift generation is observable like other operations (Priority: P3)
+
+**Goal**: Drift generation creates/reuses a run, surfaces safe failure details, and links operators to results.
+
+**Independent Test**: Trigger Drift generation; observe it in Monitoring → Operations; open the run and follow a link to Drift findings; simulate failure and confirm safe failure reason is visible on the run.
+
+### Implementation
+
+- [x] T013 [US3] Store Drift context (scope_key, baseline_run_id, current_run_id) inside the run payload so Monitoring can link to results in app/Filament/Pages/DriftLanding.php
+- [x] T014 [P] [US3] Record a sanitized failure entry (reason_code + short message) into `BulkOperationRun.failures` when Drift generation fails in app/Jobs/GenerateDriftFindingsJob.php
+- [x] T015 [US3] Add a “Drift findings” link for `drift.generate` runs in the run detail “Related” section in app/Filament/Resources/BulkOperationRunResource.php
+
+**Checkpoint**: US3 complete — drift runs are actionable and consistent with other operations.
+
+---
+
+## Phase 6: Polish & Cross-Cutting Concerns
+
+**Purpose**: Final alignment, validation, and guardrails.
+
+- [x] T016 [P] Update operator-facing notes and validation commands in specs/053-unify-runs-monitoring/quickstart.md (only if implementation changes)
+- [x] T017 [P] Update docs to match implementation if needed: specs/053-unify-runs-monitoring/spec.md and specs/053-unify-runs-monitoring/data-model.md
+- [x] T018 Run formatting on changed PHP files with `./vendor/bin/pint --dirty` (reference: specs/053-unify-runs-monitoring/quickstart.md)
+- [x] T019 Run targeted validation commands from specs/053-unify-runs-monitoring/quickstart.md (queue worker optional; run relevant Pest tests)
+- [x] T020 [P] Re-verify contracts match real URLs and access behavior in specs/053-unify-runs-monitoring/contracts/admin-pages.openapi.yaml
+
+---
+
+## Dependencies & Execution Order
+
+### Dependency Graph (User Stories)
+
+```text
+Phase 1 (Setup) ─┬─> Phase 2 (Foundational) ─┬─> US1 (P1)  ─┬─> Polish
+                  │                           ├─> US2 (P2)  │
+                  │                           └─> US3 (P3)  ┘
+                  └────────────────────────────────────────────
+```
+
+### User Story Dependencies
+
+- **US1** depends on Phase 2 (Foundational); independent of US2/US3.
+- **US2** depends on Phase 2 (Foundational); independent of US1/US3.
+- **US3** depends on Phase 2 (Foundational) and benefits from US1 (Monitoring visibility) but can be implemented independently.
+
+---
+
+## Parallel Execution Examples
+
+### US1 (Monitoring UI)
+
+```text
+After Phase 2 is complete, one developer can focus on:
+- app/Filament/Resources/BulkOperationRunResource.php (T005–T009)
+```
+
+### US2 (Start UX / Notifications)
+
+```text
+These can be done in parallel after Phase 2:
+- app/Filament/Pages/DriftLanding.php (T010–T011)
+- app/Jobs/GenerateDriftFindingsJob.php (T012)
+```
+
+### US3 (Drift observability)
+
+```text
+These can be done in parallel after Phase 2:
+- app/Filament/Pages/DriftLanding.php (T013)
+- app/Jobs/GenerateDriftFindingsJob.php (T014)
+```
+
+---
+
+## Implementation Strategy
+
+### MVP First (US1 Only)
+
+1. Complete Phase 1 + Phase 2
+2. Complete US1 (Phase 3) and validate Monitoring/Operations end-to-end
+3. Ship/demonstrate Monitoring value before expanding run producer behavior
+
+### Incremental Delivery
+
+1. US1 (Monitoring hub) → validates visibility/auditability
+2. US2 (start guardrails + notifications) → standardizes operator feedback
+3. US3 (drift linking + safe failure detail) → makes drift runs fully actionable