docs: unified operations runs specs and plan (054)

2026-01-16 19:06:30 +01:00 · 2026-01-16 19:06:30 +01:00 · 48b558db93
commit 48b558db93
parent 30ad57baab
11 changed files with 613 additions and 0 deletions
--- a/GEMINI.md
+++ b/GEMINI.md
@ -669,3 +669,11 @@ ### Replaced Utilities
 | decoration-slice | box-decoration-slice |
 | decoration-clone | box-decoration-clone |
 </laravel-boost-guidelines>
+
+## Recent Changes
+- 054-unify-runs-suitewide: Added PHP 8.4 + Filament v4, Laravel v12, Livewire v3
+- 054-unify-runs-suitewide: Added [if applicable, e.g., PostgreSQL, CoreData, files or N/A]
+- 054-unify-runs-suitewide: Added PHP 8.4 + Filament v4, Laravel v12, Livewire v3
+
+## Active Technologies
+- PostgreSQL (`operation_runs` table + JSONB) (054-unify-runs-suitewide)
--- a/specs/054-unify-runs-suitewide/checklists/requirements.md
+++ b/specs/054-unify-runs-suitewide/checklists/requirements.md
@ -0,0 +1,30 @@
+# Requirements Checklist: Unified Operations Runs
+
+## Phase 1 Adoption Set
+- [x] `inventory.sync` (Inventory “Sync now”) covered in spec
+- [x] `policy.sync` (Policies “Sync now”) covered in spec
+- [x] `directory_groups.sync` (Directory → Groups “Sync groups”) covered in spec
+- [x] `drift.generate` (Drift “Generate drift now”) covered in spec
+- [x] `backup_set.add_policies` (Backup Sets “Add selected”) covered in spec
+- [x] `restore.execute` (adapter mode) covered in spec
+
+## Critical Clarifications (Pinned)
+- [x] Retention policy defined (90 days default)
+- [x] Transition strategy defined (Parallel write: Canonical + Legacy)
+- [x] Concurrency enforcement defined (Partial unique index on active runs)
+- [x] Initiator model defined (Nullable FK + Name Snapshot)
+- [x] Restore integration defined (Physical adapter row pointing to Restore Domain record)
+
+## Functional Requirements (Spec Coverage)
+- [x] FR-001 Canonical Operation Run schema defined (see `data-model.md`)
+- [x] FR-004 Monitoring List UI specified (filters/sort defined in Spec FR-004)
+- [x] FR-005 Monitoring Detail UI specified (content defined in Spec FR-005)
+- [x] FR-007 Start surfaces behavior specified (Spec FR-007)
+- [x] FR-009 Idempotency (Partial Unique Index) strategy defined (Spec FR-009, Plan)
+- [x] FR-015 Notifications for queued/terminal states specified (Spec FR-015)
+- [x] FR-016 Tenant isolation rules specified (Spec FR-016)
+
+## Non-Functional (Spec Coverage)
+- [x] SC-002 Start confirmation < 2s target defined (Spec SC-002)
+- [x] SC-003 Deduplication rate > 99% strategy defined (Spec SC-003)
+- [x] SC-004 No secrets in failure logs rule defined (Spec SC-004)
--- a/specs/054-unify-runs-suitewide/contracts/admin-pages.openapi.yaml
+++ b/specs/054-unify-runs-suitewide/contracts/admin-pages.openapi.yaml
@ -0,0 +1,55 @@
+openapi: 3.0.3
+info:
+  title: TenantPilot Admin Operations Contracts (Feature 054)
+  version: 0.1.0
+  description: |
+    Minimal page-render contracts for the Monitoring/Operations hub.
+
+    These pages must render from the database only (no external tenant calls)
+    and display only sanitized failure detail (no secrets/tokens/raw payload dumps).
+
+servers:
+  - url: /
+
+paths:
+  /admin/t/{tenantExternalId}/bulk-operation-runs:
+    get:
+      operationId: monitoringOperationsIndex
+      summary: Monitoring → Operations (tenant-scoped)
+      parameters:
+        - name: tenantExternalId
+          in: path
+          required: true
+          schema:
+            type: string
+      responses:
+        '200':
+          description: Page renders successfully.
+        '302':
+          description: Redirect to login when unauthenticated.
+
+  /admin/t/{tenantExternalId}/bulk-operation-runs/{bulkOperationRunId}:
+    get:
+      operationId: monitoringOperationsView
+      summary: Operation run detail (tenant-scoped)
+      parameters:
+        - name: tenantExternalId
+          in: path
+          required: true
+          schema:
+            type: string
+        - name: bulkOperationRunId
+          in: path
+          required: true
+          schema:
+            type: integer
+      responses:
+        '200':
+          description: Page renders successfully.
+        '302':
+          description: Redirect to login when unauthenticated.
+        '403':
+          description: Forbidden when attempting cross-tenant access.
+
+components: {}
+
--- a/specs/054-unify-runs-suitewide/contracts/routes.md
+++ b/specs/054-unify-runs-suitewide/contracts/routes.md
@ -0,0 +1,18 @@
+# Routes & URLs
+
+## Monitoring UI
+
+### List Operations
+- **Route**: `tenant.monitoring.operations.index`
+- **URL**: `/tenants/{tenant}/monitoring/operations`
+- **Controller**: Livewire Component (`App\Livewire\Monitoring\OperationsList`)
+
+### View Operation
+- **Route**: `tenant.monitoring.operations.show`
+- **URL**: `/tenants/{tenant}/monitoring/operations/{run}`
+- **Controller**: Livewire Component (`App\Livewire\Monitoring\OperationsDetail`)
+
+## Deep Links
+- **Drift**: `/tenants/{tenant}/drift/history/{id}`
+- **Inventory**: `/tenants/{tenant}/inventory` (General, or specific timestamp if supported)
+- **Restore**: `/tenants/{tenant}/restore/{id}`
--- a/specs/054-unify-runs-suitewide/contracts/service_interface.md
+++ b/specs/054-unify-runs-suitewide/contracts/service_interface.md
@ -0,0 +1,48 @@
+# Service Interface: Operation Runs
+
+## `App\Services\OperationRunService`
+
+### `ensureRun`
+Idempotently creates or retrieves an active run.
+
+```php
+public function ensureRun(
+    Tenant $tenant,
+    string $type,
+    array $inputs,
+    ?User $initiator = null
+): OperationRun
+```
+
+- **Logic**:
+  1. Compute `hash = sha256(tenant_id + type + sorted_json(inputs))`.
+  2. Try finding active run (`queued` or `running`) with this hash.
+  3. If found, return it.
+  4. If not found, create new `queued` run.
+  5. Return run.
+
+### `updateRun`
+Updates the status/outcome of a run.
+
+```php
+public function updateRun(
+    OperationRun $run,
+    string $status,
+    ?string $outcome = null,
+    array $summaryCounts = [],
+    array $failures = []
+): OperationRun
+```
+
+### `failRun`
+Helper to fail a run immediately.
+
+```php
+public function failRun(OperationRun $run, Throwable $e): OperationRun
+```
+
+## `App\Jobs\Middleware\TrackOperationRun`
+Middleware for Jobs to automatically handle `running` -> `completed`/`failed` transitions if bound to a run.
+
+## `App\Listeners\SyncRestoreRunToOperation`
+Listener for `RestoreRun` events to update the shadow `OperationRun`.
--- a/specs/054-unify-runs-suitewide/data-model.md
+++ b/specs/054-unify-runs-suitewide/data-model.md
@ -0,0 +1,52 @@
+# Data Model: Unified Operations Runs
+
+## Entities
+
+### `OperationRun`
+Canonical record for all long-running tenant operations.
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `id` | BigInt | Yes | Primary Key |
+| `tenant_id` | BigInt | Yes | FK to Tenants |
+| `user_id` | BigInt | No | FK to Users (Initiator). Null for system/scheduler. |
+| `initiator_name` | String | Yes | Snapshot of user name or "System". |
+| `type` | String | Yes | stable taxonomy e.g., `inventory.sync`. |
+| `status` | String | Yes | Lifecycle state: `queued`, `running`, `completed`. |
+| `outcome` | String | Yes | Result bucket: `pending`, `succeeded`, `partially_succeeded`, `failed`, `cancelled`. |
+| `run_identity_hash` | String | Yes | Deterministic hash for idempotency. |
+| `summary_counts` | JSONB | No | `{ "total": 10, "success": 8, "failed": 2, "skipped": 0 }` |
+| `failure_summary` | JSONB | No | List of sanitized errors: `[{ "code": "GraphError", "message": "Throttled", "count": 1 }]` |
+| `context` | JSONB | No | Run-specific metadata. e.g., `{ "restore_run_id": 123, "selection": [...] }` |
+| `started_at` | Timestamp | No | When execution began. |
+| `completed_at` | Timestamp | No | When execution finished. |
+| `created_at` | Timestamp | Yes | |
+| `updated_at` | Timestamp | Yes | |
+
+**Indexes**:
+- `(tenant_id, run_identity_hash)` UNIQUE WHERE status IN ('queued', 'running')
+- `(tenant_id, type, created_at)` for filtering/sorting
+- `(tenant_id, created_at)` for default sort
+
+### `RestoreRun` (Existing)
+Remains the domain source of truth for Restore.
+- Linked via `OperationRun.context['restore_run_id']`.
+- `OperationRun` mirrors `RestoreRun` status/outcome.
+
+## Enums
+
+### `OperationRunStatus`
+- `queued`
+- `running`
+- `completed`
+
+### `OperationRunOutcome`
+- `pending` (default when running/queued)
+- `succeeded`
+- `partially_succeeded`
+- `failed`
+- `cancelled`
+
+## Relationships
+- `OperationRun` belongs to `Tenant`.
+- `OperationRun` belongs to `User` (optional).
--- a/specs/054-unify-runs-suitewide/plan.md
+++ b/specs/054-unify-runs-suitewide/plan.md
@ -0,0 +1,76 @@
+# Implementation Plan: Unified Operations Runs Suitewide
+
+**Branch**: `feat/054-unify-operations-runs-suitewide` | **Date**: 2026-01-16 | **Spec**: [Spec Link](spec.md)
+**Input**: Feature specification from `specs/054-unify-runs-suitewide/spec.md`
+
+## Summary
+
+This feature unifies long-running tenant operations (e.g., Inventory Sync, Drift Generation) into a single canonical `operation_runs` table. This enables a consistent "Monitoring -> Operations" view for all tenant activities. Legacy run tables will be maintained in parallel for now (Parallel Write Transition). `RestoreRun` remains a domain-specific record but will be mirrored into `operation_runs` via an adapter pattern.
+
+## Technical Context
+
+**Language/Version**: PHP 8.4
+**Primary Dependencies**: Filament v4, Laravel v12, Livewire v3
+**Storage**: PostgreSQL (`operation_runs` table + JSONB)
+**Testing**: Pest v4 (Feature tests for Service, Livewire tests for UI)
+**Target Platform**: Linux server (Docker/Dokploy)
+**Project Type**: Web Application (Laravel Monolith)
+**Performance Goals**: Start operation < 2s. List runs < 200ms.
+**Constraints**: Tenant isolation is paramount. No cross-tenant data leakage.
+**Scale/Scope**: ~50-100 runs/day per tenant. Retention 90 days.
+
+## Constitution Check
+
+*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.*
+
+- [x] Inventory-first: N/A (this is about tracking operations, not inventory state itself)
+- [x] Read/write separation: Monitoring is read-only. Starts are explicit writes.
+- [x] Graph contract path: N/A (this feature tracks runs, doesn't call Graph directly)
+- [x] Deterministic capabilities: N/A
+- [x] Tenant isolation: `operation_runs` has `tenant_id`. Policies ensure scope.
+- [x] Automation: Idempotency enforced via DB index.
+- [x] Data minimization: No secrets in `failure_summary`.
+
+## Project Structure
+
+### Documentation (this feature)
+
+```text
+specs/054-unify-runs-suitewide/
+├── plan.md              # This file
+├── research.md          # Research findings
+├── data-model.md        # Database schema
+├── quickstart.md        # Dev guide
+├── contracts/           # Service interfaces & routes
+└── tasks.md             # Task breakdown
+```
+
+### Source Code (repository root)
+
+```text
+app/
+├── Models/
+│   └── OperationRun.php
+├── Services/
+│   └── OperationRunService.php
+├── Livewire/
+│   └── Monitoring/
+│       ├── OperationsList.php
+│       └── OperationsDetail.php
+├── Jobs/
+│   └── Middleware/
+│       └── TrackOperationRun.php
+└── Listeners/
+    └── SyncRestoreRunToOperation.php
+
+database/migrations/
+└── YYYY_MM_DD_create_operation_runs_table.php
+```
+
+**Structure Decision**: Standard Laravel Service/Model/Livewire pattern.
+
+## Complexity Tracking
+
+| Violation | Why Needed | Simpler Alternative Rejected Because |
+|-----------|------------|-------------------------------------|
+| None | | |
--- a/specs/054-unify-runs-suitewide/quickstart.md
+++ b/specs/054-unify-runs-suitewide/quickstart.md
@ -0,0 +1,49 @@
+# Quickstart: Adding a New Operation
+
+## 1. Register Run Type
+Add your new type constant to `App\Enums\OperationRunType` (if using Enums) or just use the string convention `resource.action`.
+
+## 2. Implement Idempotency Inputs
+Define what makes a run "unique" for your feature.
+- Example: `['scope' => 'full']` vs `['scope' => 'policy', 'policy_id' => 1]`.
+
+## 3. Use `OperationRunService`
+In your Start Action (Controller/Livewire):
+
+```php
+// 1. Ensure Run
+$run = $service->ensureRun($tenant, 'my_resource.action', $inputs, auth()->user());
+
+// 2. Dispatch Job (if new)
+if ($run->wasRecentlyCreated) {
+    MyJob::dispatch($run, $inputs);
+}
+
+// 3. Return View Link
+return redirect()->route('tenant.monitoring.operations.show', [$tenant, $run]);
+```
+
+## 4. Instrument Job
+In your Job:
+
+```php
+public function handle()
+{
+    // Update to Running
+    $this->run->updateStatus(status: 'running');
+
+    try {
+        // ... do work ...
+        
+        // Success
+        $this->run->updateStatus(
+            status: 'completed', 
+            outcome: 'succeeded', 
+            summary: ['processed' => 100]
+        );
+    } catch (\Throwable $e) {
+        // Failure
+        $this->run->fail($e);
+    }
+}
+```
--- a/specs/054-unify-runs-suitewide/research.md
+++ b/specs/054-unify-runs-suitewide/research.md
@ -0,0 +1,65 @@
+# Research: Unified Operations Runs Suitewide
+
+## 1. Technical Context & Unknowns
+
+**Unknowns Resolved**:
+- **Transition Strategy**: Parallel write. We will maintain existing legacy tables (e.g., `inventory_sync_runs`, `restore_runs`) for now but strictly use `operation_runs` for the Monitoring UI.
+- **Restore Adapter**: `RestoreRun` remains the domain source of truth. An `OperationRun` record will be created as a "shadow" or "adapter" record. This requires hooking into `RestoreRun` lifecycle events or the service layer to keep them in sync.
+- **Run Logic Location**: Existing jobs like `RunInventorySyncJob` will be updated to manage the `OperationRun` state.
+- **Concurrency**: Enforced by partial unique index on `(tenant_id, run_identity_hash)` where status is active (`queued`, `running`).
+
+## 2. Technology Choices
+
+| Area | Decision | Rationale | Alternatives |
+|------|----------|-----------|--------------|
+| **Schema** | `operation_runs` table | Centralized table allows simple, performant Monitoring queries without complex UNIONs across disparate legacy tables. | Virtual UNION view (Complex, harder to paginate/sort efficiently). |
+| **Restore Integration** | Physical Adapter Row | Decouples Monitoring from Restore domain specifics. Allows uniform "list all runs" queries. The `context` JSON column will store `{ "restore_run_id": ... }`. | Polymorphic relation (Overhead for a single exception). |
+| **Idempotency** | DB Partial Unique Index | Hard guarantee against race conditions. Simpler than distributed locks (Redis) which can expire or fail. | Redis Lock (Soft guarantee), Application check (Race prone). |
+| **Initiator** | Nullable FK + Name | Handles both Users (FK) and System/Scheduler (Name "System") uniformly. | Polymorphic relation (Overkill for simple auditing). |
+
+## 3. Implementation Patterns
+
+### Canonical Run Lifecycle
+1.  **Start Request**:
+    -   Compute `run_identity_hash` from inputs.
+    -   Attempt `INSERT` into `operation_runs` (ignore conflict if active).
+    -   If active run exists, return it (Idempotency).
+    -   If new, dispatch Job.
+2.  **Job Execution**:
+    -   Update status to `running`.
+    -   Perform work.
+    -   Update status to `succeeded`/`failed`.
+3.  **Restore Adapter**:
+    -   When `RestoreRun` is created, create `OperationRun` (queued/running).
+    -   When `RestoreRun` updates (status change), update `OperationRun`.
+
+### Data Model
+```sql
+CREATE TABLE operation_runs (
+    id BIGSERIAL PRIMARY KEY,
+    tenant_id BIGINT NOT NULL REFERENCES tenants(id),
+    user_id BIGINT NULL REFERENCES users(id), -- Initiator
+    initiator_name VARCHAR(255) NOT NULL, -- "John Doe" or "System"
+    type VARCHAR(255) NOT NULL, -- "inventory.sync"
+    status VARCHAR(50) NOT NULL, -- queued, running, completed
+    outcome VARCHAR(50) NOT NULL, -- pending, succeeded, partially_succeeded, failed, cancelled
+    run_identity_hash VARCHAR(64) NOT NULL, -- SHA256(tenant_id + inputs)
+    summary_counts JSONB DEFAULT '{}', -- { success: 10, failed: 2 }
+    failure_summary JSONB DEFAULT '[]', -- [{ code: "ERR_TIMEOUT", message: "..." }]
+    context JSONB DEFAULT '{}', -- { selection: [...], restore_run_id: 123 }
+    started_at TIMESTAMP NULL,
+    completed_at TIMESTAMP NULL,
+    created_at TIMESTAMP,
+    updated_at TIMESTAMP
+);
+
+CREATE UNIQUE INDEX operation_runs_active_unique 
+ON operation_runs (tenant_id, run_identity_hash) 
+WHERE status IN ('queued', 'running');
+```
+
+## 4. Risks & Mitigations
+-   **Risk**: Desync between `RestoreRun` and `OperationRun`.
+    -   **Mitigation**: Use model observers or service-layer wrapping to ensure atomic-like updates, or accept slight eventual consistency (Monitoring might lag ms behind Restore UI).
+-   **Risk**: Legacy runs not appearing.
+    -   **Mitigation**: We are NOT backfilling legacy runs. Only new runs after deployment will appear in the new Monitoring UI. This is acceptable for "Phase 1".
--- a/specs/054-unify-runs-suitewide/spec.md
+++ b/specs/054-unify-runs-suitewide/spec.md
@ -0,0 +1,148 @@
+# Feature Specification: Unified Operations Runs Suitewide (Except Restore Domain Model) (054)
+
+**Feature Branch**: `feat/054-unify-operations-runs-suitewide`  
+**Created**: 2026-01-16  
+**Status**: Draft  
+**Input**: User description: "Eliminate run sprawl by adopting one canonical tenant-scoped operation run record for long-running actions across the product, surfaced consistently in Monitoring → Operations, while keeping restore as a separate domain workflow that is still visible via an adapter entry."
+
+## Clarifications
+
+### Session 2026-01-16
+
+- Q: Welche Default-Retention soll 054 für canonical Operation Runs festlegen? → A: 90 days
+- Q: Transition-Strategie in 054: schreiben wir canonical Runs parallel zu Legacy-Run-Tabellen, oder ersetzen wir sofort? → A: Parallel write (canonical + legacy)
+- Q: For `restore.execute`, the spec mentions it acts as an "adapter entry" linking to the restore domain record. How should this be implemented? → A: Physical Row (Create a physical row in `operation_runs` that points to the restore record).
+- Q: How should concurrency and deduplication (FR-009) be enforced at the database level? → A: Partial Unique Index (unique constraint on `tenant_id, run_identity_hash` where outcome is `queued` or `running`).
+- Q: How should the `initiator` be modeled to support both users and system processes (FR-001)? → A: Nullable FK + Name Snapshot (`user_id` nullable FK + required `initiator_name` string).
+
+## User Scenarios & Testing *(mandatory)*
+
+### User Story 1 - See Every Supported Operation in Monitoring (Priority: P1)
+
+As an operator, I want Monitoring → Operations to show all supported long-running operations for my tenant in one consistent list and detail view, so I can quickly answer what ran, who started it, whether it succeeded/partially succeeded/failed, and where to look next.
+
+**Why this priority**: This is the core value: a single, tenant-scoped source of truth for operational visibility.
+
+**Independent Test**: Trigger at least one run of each Phase 1 run producer, then verify each appears in Monitoring with consistent status/outcome semantics, safe failure summaries, and context links.
+
+**Acceptance Scenarios**:
+
+1. **Given** I am signed into tenant A, **When** I open Monitoring → Operations, **Then** I see only tenant A runs and can filter by run type, outcome bucket, time range, and initiator.
+2. **Given** multiple run types exist, **When** I filter to `inventory.sync`, **Then** only inventory sync runs are shown.
+3. **Given** a run exists, **When** I open its detail view, **Then** I can see initiator, run type, outcome bucket, timestamps, summary counts (if applicable), sanitized failures (if any), and links to relevant feature context/results.
+4. **Given** restore execution exists, **When** I open Monitoring → Operations, **Then** I can see a `restore.execute` entry that links to the existing restore record (restore history remains owned by the restore domain record).
+5. **Given** I am a `Readonly` user in tenant A, **When** I view Monitoring → Operations, **Then** I can view runs and details but I do not see any start/rerun/cancel/delete controls.
+6. **Given** I attempt to access a run from another tenant (direct link or list), **When** I request it, **Then** access is denied and no run details are disclosed.
+
+---
+
+### User Story 2 - Start Operations Without Blocking (Priority: P2)
+
+As an operator, when I start a supported operation, I want immediate confirmation and a “View run” link so I can continue working while the operation runs in the background.
+
+**Why this priority**: Removes long-running requests/timeouts and standardizes how operations are started and observed.
+
+**Independent Test**: Start each Phase 1 operation from its owning UI and confirm the start returns quickly, includes “View run”, and the run progresses through queued/running into a terminal outcome.
+
+**Acceptance Scenarios**:
+
+1. **Given** I have permission to start a Phase 1 operation in tenant A, **When** I start it, **Then** I receive immediate confirmation with a “View run” link and the run is visible as queued or running.
+2. **Given** I am a `Readonly` user in tenant A, **When** I attempt to start any Phase 1 operation, **Then** the system denies the request and does not create a new run.
+3. **Given** the run reaches a terminal outcome, **When** that occurs, **Then** the initiating user receives an in-app notification including a short summary and a “View run” link.
+4. **Given** background processing is unavailable, **When** I attempt to start an operation, **Then** I receive a clear message and the system MUST NOT claim it was queued.
+
+---
+
+### User Story 3 - Duplicate Starts Reuse the Same Active Run (Priority: P3)
+
+As an operator, I want accidental double-starts (double clicks, two admins, retries) to reuse the same active run so duplicate background work is avoided and results remain auditable.
+
+**Why this priority**: Reduces load, prevents confusing duplicate outcomes, and makes operations safer under concurrency.
+
+**Independent Test**: Start the same operation twice with identical effective inputs while the first is queued/running and verify the system reuses the active run.
+
+**Acceptance Scenarios**:
+
+1. **Given** an identical run is queued/running for a tenant, **When** another start request is made with the same effective inputs, **Then** the system reuses the existing run and does not start a second one.
+2. **Given** two starts happen at nearly the same time, **When** the system resolves the race, **Then** at most one active run exists for that identity and both users are directed to it.
+
+### Edge Cases
+
+- Background execution unavailable: start fails fast with a clear message; the system MUST NOT create misleading “queued” runs.
+- Partial processing: at least one success and at least one failure yields “partially succeeded”, with per-item failures when applicable.
+- Large run history: Monitoring remains usable with filters and defaults (recent runs, last 30 days).
+- Permissions revoked mid-run: the run continues; visibility is evaluated at time of access.
+
+## Requirements *(mandatory)*
+
+**Constitution alignment (required):** If this feature introduces any external tenant API calls or any write/change behavior,
+the spec MUST describe contract registry updates, safety gates (preview/confirmation/audit), tenant isolation, and tests.
+
+### Scope & Assumptions
+
+**Phase 1 adoption set (must be implemented):**
+
+- `inventory.sync` (Inventory “Sync now”)
+- `policy.sync` (Policies “Sync now”)
+- `directory_groups.sync` (Directory → Groups “Sync groups”)
+- `drift.generate` (Drift “Generate drift now” / auto-on-open when eligible)
+- `backup_set.add_policies` (Backup Sets “Add selected” / “Add policies”)
+
+**Restore visibility (adapter only):**
+
+- `restore.execute` appears as a canonical run entry that links to an existing restore domain record.
+- Restore execution history remains owned by the restore domain record (not replaced in Phase 1).
+
+**Out of scope for 054 (explicit):**
+
+- Cross-tenant compare/promotion
+- UI redesign/styling polish (separate UI polish work)
+- Cancel/rerun/delete controls inside Monitoring hub (hub stays view-only)
+- Replacing restore domain records with canonical runs
+- A full settings UI for retention/notifications/etc.
+
+**Assumptions (defaults to remove ambiguity in Phase 1):**
+
+- Canonical run history retention defaults to 90 days, with no user-facing retention configuration in 054.
+- System-initiated runs (if any) do not notify users by default in Phase 1.
+- Transition strategy: write canonical runs in parallel with any existing legacy per-module run tables (where they exist); Monitoring uses canonical runs as the source of truth immediately.
+
+### Functional Requirements
+
+- **FR-001 Canonical Operation Run**: System MUST represent each supported operation execution as a canonical, tenant-scoped operation run record that captures initiator (nullable `user_id` FK + `initiator_name` string), run type, lifecycle status/timestamps, outcome bucket, summary counts (when applicable), safe failure summaries, an idempotency identity for dedupe, and a safe context payload referencing “what this run was about”.
+- **FR-002 Run taxonomy**: Run type MUST be stable and follow `"<resource>.<action>"`.
+- **FR-003 Phase 1 run types**: Phase 1 run types MUST include `inventory.sync`, `policy.sync`, `directory_groups.sync`, `drift.generate`, `backup_set.add_policies`, plus `restore.execute` implemented as a physical `operation_runs` record (adapter) pointing to the domain entity.
+- **FR-004 Monitoring lists all canonical runs**: Monitoring → Operations MUST list canonical runs for the active tenant with filters for run type, outcome bucket, time range, and initiator; default sort is most recent first; default time window is last 30 days.
+- **FR-005 Run detail**: Run detail MUST show initiator, run type, outcome bucket, timestamps (created/started/finished), summary counts (when applicable), sanitized failures (including per-item failures when applicable), and contextual links to owning feature surfaces/results.
+- **FR-006 View-only hub**: Monitoring hub MUST be view-only (no start/rerun/cancel/delete controls) and MUST link back to owning feature surfaces.
+- **FR-007 Start surfaces always enqueue**: Every Phase 1 start surface MUST authorize start, create/reuse a canonical run (dedupe), dispatch background execution, and return immediately with confirmation + “View run”.
+- **FR-008 No remote work in interactive request**: Start surfaces MUST NOT perform remote work inline; long-running work happens in background execution.
+- **FR-009 Deterministic idempotency**: For each run type, the system MUST define a deterministic identity for “identical run” based on tenant + effective inputs; initiator MUST NOT be part of identity. **Enforcement**: Uniqueness MUST be enforced via a partial unique index on `(tenant_id, run_identity_hash)` where outcome is `queued` or `running`.
+- **FR-010 Phase 1 identity rules**: Identity rules MUST be defined at least as follows:
+  - `inventory.sync`: tenant + selection scope
+  - `policy.sync`: tenant + effective policy scope
+  - `directory_groups.sync`: tenant + selection (Phase 1 default: “all groups”)
+  - `backup_set.add_policies`: tenant + backup set + selected policies + option flags (if exposed)
+  - `drift.generate`: tenant + scope key + baseline/current comparison inputs
+- **FR-011 Outcome buckets**: Monitoring MUST present consistent outcome buckets: `queued`, `running`, `succeeded`, `partially succeeded`, `failed`.
+- **FR-012 Partial vs failed**: “Partially succeeded” means at least one success and at least one failure; “Failed” means zero successes or cannot proceed.
+- **FR-013 Failure details are safe + useful**: Failures MUST be persisted and displayed as stable reason codes and short sanitized messages; failures MUST NOT include secrets/tokens/credentials/PII or full external payload dumps.
+- **FR-014 Related links**: Run detail MUST include contextual links where applicable (e.g., drift findings, backup set, inventory results, directory groups, restore detail for `restore.execute`).
+- **FR-015 Notifications**: System MUST emit in-app notifications for “queued” (after start) and terminal outcomes for Phase 1 runs; notifications MUST include a short summary and a “View run” link; recipients are the initiating user only.
+- **FR-016 Tenant isolation**: All run list/detail access MUST be tenant-scoped; cross-tenant access MUST be denied without disclosing run details.
+- **FR-017 No render-time remote calls**: Monitoring pages MUST be render-safe and MUST NOT depend on external service calls during render.
+- **FR-018 Roles & permissions**: Roles `Owner`, `Manager`, `Operator`, and `Readonly` MUST be able to view runs; only `Owner`, `Manager`, `Operator` may start operations; `Readonly` is strictly view-only.
+
+### Key Entities *(include if feature involves data)*
+
+- **Canonical Operation Run**: A tenant-scoped record representing the lifecycle of a long-running operation, including run type, initiator (nullable `user_id` FK + `initiator_name` string), lifecycle state/timestamps, outcome bucket, summary counts, safe failure summaries, idempotency identity (uniqueness enforced by DB index on active runs), and safe context references.
+- **Restore domain record (exception)**: Restore remains a domain workflow record with richer semantics and history. Monitoring shows restore activity through a physical `operation_runs` row (adapter) that links back to the restore record, without replacing it.
+
+## Success Criteria *(mandatory)*
+
+### Measurable Outcomes
+
+- **SC-001**: Operators can answer “what ran, when, and did it succeed?” for any Phase 1 run in under 1 minute using Monitoring → Operations.
+- **SC-002**: Starting a Phase 1 operation returns confirmation + “View run” link within 2 seconds under normal conditions.
+- **SC-003**: Duplicate starts reuse the same active run in at least 99% of attempts under normal conditions.
+- **SC-004**: No secrets/tokens/credentials/PII appear in persisted failures or notifications (verified by tests).
--- a/specs/054-unify-runs-suitewide/tasks.md
+++ b/specs/054-unify-runs-suitewide/tasks.md
@ -0,0 +1,64 @@
+# Tasks: Unified Operations Runs Suitewide
+
+**Feature**: `054-unify-runs-suitewide`
+**Spec**: `specs/054-unify-runs-suitewide/spec.md`
+
+## Phase 1: Foundation (DB & Service)
+
+- [ ] **Migration**: Create `operation_runs` table with partial unique index on `(tenant_id, run_identity_hash)` where status in `queued, running`.
+- [ ] **Model**: Create `OperationRun` model with casts (JSONB for summaries/context), relationship to `Tenant` and `User`.
+- [ ] **Service**: Implement `OperationRunService::ensureRun()` (idempotent creation) and `updateRun()` methods.
+- [ ] **Test**: Feature test for `ensureRun` verifying idempotency (same hash = same run) and concurrency safety (simulated).
+- [ ] **Test**: Feature test for `updateRun` verifying status transitions and history logging (if any).
+- [ ] **Job Middleware**: Create `TrackOperationRun` middleware to automatically handle job success/failure updates for jobs using this system.
+- [ ] **Retention**: Create a daily scheduled job to prune `operation_runs` older than 90 days.
+
+## Phase 2: Monitoring UI (Read-Only)
+
+- [ ] **Page**: Create Filament Page `Monitoring/Operations` (List) strictly scoped to current tenant.
+- [ ] **Table**: Implement `OperationRun` table with columns: Status (Badge), Operation Type, Initiator, Started At, Duration, Outcome.
+- [ ] **Filters**: Add table filters for `Type`, `Outcome`, `Date Range`, `Initiator`.
+- [ ] **Detail View**: Create "View Run" modal or separate page showing:
+    - Summary counts (Success/Fail/Total)
+    - Failure list (Sanitized codes/messages)
+    - Context JSON (Debug info)
+    - Timeline (Created/Started/Finished)
+- [ ] **Test**: Livewire test verifying `Readonly` users can see table but no actions.
+- [ ] **Test**: Verify cross-tenant access is blocked.
+
+## Phase 3: Producer Migration (Parallel Write)
+
+### Inventory Sync (`inventory.sync`)
+- [ ] **Refactor**: Update `RunInventorySyncJob` dispatch logic to call `OperationRunService::ensureRun()` first.
+- [ ] **Refactor**: Update Job to use `TrackOperationRun` middleware (or manual updates) to sync status to `operation_runs`.
+- [ ] **Verify**: Ensure legacy `inventory_sync_runs` is still written to (if legacy UI depends on it) OR confirm legacy UI is replaced. *Decision: Parallel write as per spec.*
+
+### Policy Sync (`policy.sync`)
+- [ ] **Refactor**: Update Policy Sync start logic to use `OperationRunService`.
+- [ ] **Refactor**: Instrument Policy Sync job to update `operation_runs`.
+
+### Directory Groups Sync (`directory_groups.sync`)
+- [ ] **Refactor**: Update Group Sync start logic to use `OperationRunService`.
+- [ ] **Refactor**: Instrument Group Sync job to update `operation_runs`.
+
+### Drift Generation (`drift.generate`)
+- [ ] **Refactor**: Update Drift Generation start logic to use `OperationRunService`.
+- [ ] **Refactor**: Instrument Drift job to update `operation_runs`.
+
+### Backup Set (`backup_set.add_policies`)
+- [ ] **Refactor**: Update "Add Policies" action to use `OperationRunService`.
+
+## Phase 4: Restore Adapter
+
+- [ ] **Listener**: Create `SyncRestoreRunToOperation` listener observing `RestoreRun` events (`created`, `updated`).
+- [ ] **Logic**: Map `RestoreRun` status/outcomes to `OperationRun` schema.
+    - `RestoreRun` created -> `OperationRun` created (queued/running).
+    - `RestoreRun` updated -> `OperationRun` updated.
+- [ ] **Context**: Store `{"restore_run_id": <id>}` in `OperationRun.context`.
+- [ ] **Test**: Verify creating a `RestoreRun` automatically spawns a shadow `OperationRun`.
+
+## Phase 5: Notifications & Polish
+
+- [ ] **Notifications**: Implement Database Notifications for "Run Started" (with link) and "Run Completed" (with outcome).
+- [ ] **Frontend**: Ensure "View Run" link in Toast notifications correctly opens the Monitoring Detail view.
+- [ ] **Final Verify**: Run through the `requirements.md` checklist manually.