From 48b558db93abddfcb91a5b9ead91a5a2870eec99 Mon Sep 17 00:00:00 2001 From: Ahmed Darrazi Date: Fri, 16 Jan 2026 19:06:30 +0100 Subject: [PATCH] docs: unified operations runs specs and plan (054) --- GEMINI.md | 8 + .../checklists/requirements.md | 30 ++++ .../contracts/admin-pages.openapi.yaml | 55 +++++++ .../contracts/routes.md | 18 +++ .../contracts/service_interface.md | 48 ++++++ specs/054-unify-runs-suitewide/data-model.md | 52 ++++++ specs/054-unify-runs-suitewide/plan.md | 76 +++++++++ specs/054-unify-runs-suitewide/quickstart.md | 49 ++++++ specs/054-unify-runs-suitewide/research.md | 65 ++++++++ specs/054-unify-runs-suitewide/spec.md | 148 ++++++++++++++++++ specs/054-unify-runs-suitewide/tasks.md | 64 ++++++++ 11 files changed, 613 insertions(+) create mode 100644 specs/054-unify-runs-suitewide/checklists/requirements.md create mode 100644 specs/054-unify-runs-suitewide/contracts/admin-pages.openapi.yaml create mode 100644 specs/054-unify-runs-suitewide/contracts/routes.md create mode 100644 specs/054-unify-runs-suitewide/contracts/service_interface.md create mode 100644 specs/054-unify-runs-suitewide/data-model.md create mode 100644 specs/054-unify-runs-suitewide/plan.md create mode 100644 specs/054-unify-runs-suitewide/quickstart.md create mode 100644 specs/054-unify-runs-suitewide/research.md create mode 100644 specs/054-unify-runs-suitewide/spec.md create mode 100644 specs/054-unify-runs-suitewide/tasks.md diff --git a/GEMINI.md b/GEMINI.md index d9e766e..220d86e 100644 --- a/GEMINI.md +++ b/GEMINI.md @@ -669,3 +669,11 @@ ### Replaced Utilities | decoration-slice | box-decoration-slice | | decoration-clone | box-decoration-clone | + +## Recent Changes +- 054-unify-runs-suitewide: Added PHP 8.4 + Filament v4, Laravel v12, Livewire v3 +- 054-unify-runs-suitewide: Added [if applicable, e.g., PostgreSQL, CoreData, files or N/A] +- 054-unify-runs-suitewide: Added PHP 8.4 + Filament v4, Laravel v12, Livewire v3 + +## Active Technologies +- PostgreSQL (`operation_runs` table + JSONB) (054-unify-runs-suitewide) diff --git a/specs/054-unify-runs-suitewide/checklists/requirements.md b/specs/054-unify-runs-suitewide/checklists/requirements.md new file mode 100644 index 0000000..2886bac --- /dev/null +++ b/specs/054-unify-runs-suitewide/checklists/requirements.md @@ -0,0 +1,30 @@ +# Requirements Checklist: Unified Operations Runs + +## Phase 1 Adoption Set +- [x] `inventory.sync` (Inventory “Sync now”) covered in spec +- [x] `policy.sync` (Policies “Sync now”) covered in spec +- [x] `directory_groups.sync` (Directory → Groups “Sync groups”) covered in spec +- [x] `drift.generate` (Drift “Generate drift now”) covered in spec +- [x] `backup_set.add_policies` (Backup Sets “Add selected”) covered in spec +- [x] `restore.execute` (adapter mode) covered in spec + +## Critical Clarifications (Pinned) +- [x] Retention policy defined (90 days default) +- [x] Transition strategy defined (Parallel write: Canonical + Legacy) +- [x] Concurrency enforcement defined (Partial unique index on active runs) +- [x] Initiator model defined (Nullable FK + Name Snapshot) +- [x] Restore integration defined (Physical adapter row pointing to Restore Domain record) + +## Functional Requirements (Spec Coverage) +- [x] FR-001 Canonical Operation Run schema defined (see `data-model.md`) +- [x] FR-004 Monitoring List UI specified (filters/sort defined in Spec FR-004) +- [x] FR-005 Monitoring Detail UI specified (content defined in Spec FR-005) +- [x] FR-007 Start surfaces behavior specified (Spec FR-007) +- [x] FR-009 Idempotency (Partial Unique Index) strategy defined (Spec FR-009, Plan) +- [x] FR-015 Notifications for queued/terminal states specified (Spec FR-015) +- [x] FR-016 Tenant isolation rules specified (Spec FR-016) + +## Non-Functional (Spec Coverage) +- [x] SC-002 Start confirmation < 2s target defined (Spec SC-002) +- [x] SC-003 Deduplication rate > 99% strategy defined (Spec SC-003) +- [x] SC-004 No secrets in failure logs rule defined (Spec SC-004) diff --git a/specs/054-unify-runs-suitewide/contracts/admin-pages.openapi.yaml b/specs/054-unify-runs-suitewide/contracts/admin-pages.openapi.yaml new file mode 100644 index 0000000..c3d5238 --- /dev/null +++ b/specs/054-unify-runs-suitewide/contracts/admin-pages.openapi.yaml @@ -0,0 +1,55 @@ +openapi: 3.0.3 +info: + title: TenantPilot Admin Operations Contracts (Feature 054) + version: 0.1.0 + description: | + Minimal page-render contracts for the Monitoring/Operations hub. + + These pages must render from the database only (no external tenant calls) + and display only sanitized failure detail (no secrets/tokens/raw payload dumps). + +servers: + - url: / + +paths: + /admin/t/{tenantExternalId}/bulk-operation-runs: + get: + operationId: monitoringOperationsIndex + summary: Monitoring → Operations (tenant-scoped) + parameters: + - name: tenantExternalId + in: path + required: true + schema: + type: string + responses: + '200': + description: Page renders successfully. + '302': + description: Redirect to login when unauthenticated. + + /admin/t/{tenantExternalId}/bulk-operation-runs/{bulkOperationRunId}: + get: + operationId: monitoringOperationsView + summary: Operation run detail (tenant-scoped) + parameters: + - name: tenantExternalId + in: path + required: true + schema: + type: string + - name: bulkOperationRunId + in: path + required: true + schema: + type: integer + responses: + '200': + description: Page renders successfully. + '302': + description: Redirect to login when unauthenticated. + '403': + description: Forbidden when attempting cross-tenant access. + +components: {} + diff --git a/specs/054-unify-runs-suitewide/contracts/routes.md b/specs/054-unify-runs-suitewide/contracts/routes.md new file mode 100644 index 0000000..f139181 --- /dev/null +++ b/specs/054-unify-runs-suitewide/contracts/routes.md @@ -0,0 +1,18 @@ +# Routes & URLs + +## Monitoring UI + +### List Operations +- **Route**: `tenant.monitoring.operations.index` +- **URL**: `/tenants/{tenant}/monitoring/operations` +- **Controller**: Livewire Component (`App\Livewire\Monitoring\OperationsList`) + +### View Operation +- **Route**: `tenant.monitoring.operations.show` +- **URL**: `/tenants/{tenant}/monitoring/operations/{run}` +- **Controller**: Livewire Component (`App\Livewire\Monitoring\OperationsDetail`) + +## Deep Links +- **Drift**: `/tenants/{tenant}/drift/history/{id}` +- **Inventory**: `/tenants/{tenant}/inventory` (General, or specific timestamp if supported) +- **Restore**: `/tenants/{tenant}/restore/{id}` \ No newline at end of file diff --git a/specs/054-unify-runs-suitewide/contracts/service_interface.md b/specs/054-unify-runs-suitewide/contracts/service_interface.md new file mode 100644 index 0000000..b3db982 --- /dev/null +++ b/specs/054-unify-runs-suitewide/contracts/service_interface.md @@ -0,0 +1,48 @@ +# Service Interface: Operation Runs + +## `App\Services\OperationRunService` + +### `ensureRun` +Idempotently creates or retrieves an active run. + +```php +public function ensureRun( + Tenant $tenant, + string $type, + array $inputs, + ?User $initiator = null +): OperationRun +``` + +- **Logic**: + 1. Compute `hash = sha256(tenant_id + type + sorted_json(inputs))`. + 2. Try finding active run (`queued` or `running`) with this hash. + 3. If found, return it. + 4. If not found, create new `queued` run. + 5. Return run. + +### `updateRun` +Updates the status/outcome of a run. + +```php +public function updateRun( + OperationRun $run, + string $status, + ?string $outcome = null, + array $summaryCounts = [], + array $failures = [] +): OperationRun +``` + +### `failRun` +Helper to fail a run immediately. + +```php +public function failRun(OperationRun $run, Throwable $e): OperationRun +``` + +## `App\Jobs\Middleware\TrackOperationRun` +Middleware for Jobs to automatically handle `running` -> `completed`/`failed` transitions if bound to a run. + +## `App\Listeners\SyncRestoreRunToOperation` +Listener for `RestoreRun` events to update the shadow `OperationRun`. \ No newline at end of file diff --git a/specs/054-unify-runs-suitewide/data-model.md b/specs/054-unify-runs-suitewide/data-model.md new file mode 100644 index 0000000..589e6ac --- /dev/null +++ b/specs/054-unify-runs-suitewide/data-model.md @@ -0,0 +1,52 @@ +# Data Model: Unified Operations Runs + +## Entities + +### `OperationRun` +Canonical record for all long-running tenant operations. + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `id` | BigInt | Yes | Primary Key | +| `tenant_id` | BigInt | Yes | FK to Tenants | +| `user_id` | BigInt | No | FK to Users (Initiator). Null for system/scheduler. | +| `initiator_name` | String | Yes | Snapshot of user name or "System". | +| `type` | String | Yes | stable taxonomy e.g., `inventory.sync`. | +| `status` | String | Yes | Lifecycle state: `queued`, `running`, `completed`. | +| `outcome` | String | Yes | Result bucket: `pending`, `succeeded`, `partially_succeeded`, `failed`, `cancelled`. | +| `run_identity_hash` | String | Yes | Deterministic hash for idempotency. | +| `summary_counts` | JSONB | No | `{ "total": 10, "success": 8, "failed": 2, "skipped": 0 }` | +| `failure_summary` | JSONB | No | List of sanitized errors: `[{ "code": "GraphError", "message": "Throttled", "count": 1 }]` | +| `context` | JSONB | No | Run-specific metadata. e.g., `{ "restore_run_id": 123, "selection": [...] }` | +| `started_at` | Timestamp | No | When execution began. | +| `completed_at` | Timestamp | No | When execution finished. | +| `created_at` | Timestamp | Yes | | +| `updated_at` | Timestamp | Yes | | + +**Indexes**: +- `(tenant_id, run_identity_hash)` UNIQUE WHERE status IN ('queued', 'running') +- `(tenant_id, type, created_at)` for filtering/sorting +- `(tenant_id, created_at)` for default sort + +### `RestoreRun` (Existing) +Remains the domain source of truth for Restore. +- Linked via `OperationRun.context['restore_run_id']`. +- `OperationRun` mirrors `RestoreRun` status/outcome. + +## Enums + +### `OperationRunStatus` +- `queued` +- `running` +- `completed` + +### `OperationRunOutcome` +- `pending` (default when running/queued) +- `succeeded` +- `partially_succeeded` +- `failed` +- `cancelled` + +## Relationships +- `OperationRun` belongs to `Tenant`. +- `OperationRun` belongs to `User` (optional). diff --git a/specs/054-unify-runs-suitewide/plan.md b/specs/054-unify-runs-suitewide/plan.md new file mode 100644 index 0000000..0a09e67 --- /dev/null +++ b/specs/054-unify-runs-suitewide/plan.md @@ -0,0 +1,76 @@ +# Implementation Plan: Unified Operations Runs Suitewide + +**Branch**: `feat/054-unify-operations-runs-suitewide` | **Date**: 2026-01-16 | **Spec**: [Spec Link](spec.md) +**Input**: Feature specification from `specs/054-unify-runs-suitewide/spec.md` + +## Summary + +This feature unifies long-running tenant operations (e.g., Inventory Sync, Drift Generation) into a single canonical `operation_runs` table. This enables a consistent "Monitoring -> Operations" view for all tenant activities. Legacy run tables will be maintained in parallel for now (Parallel Write Transition). `RestoreRun` remains a domain-specific record but will be mirrored into `operation_runs` via an adapter pattern. + +## Technical Context + +**Language/Version**: PHP 8.4 +**Primary Dependencies**: Filament v4, Laravel v12, Livewire v3 +**Storage**: PostgreSQL (`operation_runs` table + JSONB) +**Testing**: Pest v4 (Feature tests for Service, Livewire tests for UI) +**Target Platform**: Linux server (Docker/Dokploy) +**Project Type**: Web Application (Laravel Monolith) +**Performance Goals**: Start operation < 2s. List runs < 200ms. +**Constraints**: Tenant isolation is paramount. No cross-tenant data leakage. +**Scale/Scope**: ~50-100 runs/day per tenant. Retention 90 days. + +## Constitution Check + +*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.* + +- [x] Inventory-first: N/A (this is about tracking operations, not inventory state itself) +- [x] Read/write separation: Monitoring is read-only. Starts are explicit writes. +- [x] Graph contract path: N/A (this feature tracks runs, doesn't call Graph directly) +- [x] Deterministic capabilities: N/A +- [x] Tenant isolation: `operation_runs` has `tenant_id`. Policies ensure scope. +- [x] Automation: Idempotency enforced via DB index. +- [x] Data minimization: No secrets in `failure_summary`. + +## Project Structure + +### Documentation (this feature) + +```text +specs/054-unify-runs-suitewide/ +├── plan.md # This file +├── research.md # Research findings +├── data-model.md # Database schema +├── quickstart.md # Dev guide +├── contracts/ # Service interfaces & routes +└── tasks.md # Task breakdown +``` + +### Source Code (repository root) + +```text +app/ +├── Models/ +│ └── OperationRun.php +├── Services/ +│ └── OperationRunService.php +├── Livewire/ +│ └── Monitoring/ +│ ├── OperationsList.php +│ └── OperationsDetail.php +├── Jobs/ +│ └── Middleware/ +│ └── TrackOperationRun.php +└── Listeners/ + └── SyncRestoreRunToOperation.php + +database/migrations/ +└── YYYY_MM_DD_create_operation_runs_table.php +``` + +**Structure Decision**: Standard Laravel Service/Model/Livewire pattern. + +## Complexity Tracking + +| Violation | Why Needed | Simpler Alternative Rejected Because | +|-----------|------------|-------------------------------------| +| None | | | \ No newline at end of file diff --git a/specs/054-unify-runs-suitewide/quickstart.md b/specs/054-unify-runs-suitewide/quickstart.md new file mode 100644 index 0000000..9a99384 --- /dev/null +++ b/specs/054-unify-runs-suitewide/quickstart.md @@ -0,0 +1,49 @@ +# Quickstart: Adding a New Operation + +## 1. Register Run Type +Add your new type constant to `App\Enums\OperationRunType` (if using Enums) or just use the string convention `resource.action`. + +## 2. Implement Idempotency Inputs +Define what makes a run "unique" for your feature. +- Example: `['scope' => 'full']` vs `['scope' => 'policy', 'policy_id' => 1]`. + +## 3. Use `OperationRunService` +In your Start Action (Controller/Livewire): + +```php +// 1. Ensure Run +$run = $service->ensureRun($tenant, 'my_resource.action', $inputs, auth()->user()); + +// 2. Dispatch Job (if new) +if ($run->wasRecentlyCreated) { + MyJob::dispatch($run, $inputs); +} + +// 3. Return View Link +return redirect()->route('tenant.monitoring.operations.show', [$tenant, $run]); +``` + +## 4. Instrument Job +In your Job: + +```php +public function handle() +{ + // Update to Running + $this->run->updateStatus(status: 'running'); + + try { + // ... do work ... + + // Success + $this->run->updateStatus( + status: 'completed', + outcome: 'succeeded', + summary: ['processed' => 100] + ); + } catch (\Throwable $e) { + // Failure + $this->run->fail($e); + } +} +``` diff --git a/specs/054-unify-runs-suitewide/research.md b/specs/054-unify-runs-suitewide/research.md new file mode 100644 index 0000000..679092c --- /dev/null +++ b/specs/054-unify-runs-suitewide/research.md @@ -0,0 +1,65 @@ +# Research: Unified Operations Runs Suitewide + +## 1. Technical Context & Unknowns + +**Unknowns Resolved**: +- **Transition Strategy**: Parallel write. We will maintain existing legacy tables (e.g., `inventory_sync_runs`, `restore_runs`) for now but strictly use `operation_runs` for the Monitoring UI. +- **Restore Adapter**: `RestoreRun` remains the domain source of truth. An `OperationRun` record will be created as a "shadow" or "adapter" record. This requires hooking into `RestoreRun` lifecycle events or the service layer to keep them in sync. +- **Run Logic Location**: Existing jobs like `RunInventorySyncJob` will be updated to manage the `OperationRun` state. +- **Concurrency**: Enforced by partial unique index on `(tenant_id, run_identity_hash)` where status is active (`queued`, `running`). + +## 2. Technology Choices + +| Area | Decision | Rationale | Alternatives | +|------|----------|-----------|--------------| +| **Schema** | `operation_runs` table | Centralized table allows simple, performant Monitoring queries without complex UNIONs across disparate legacy tables. | Virtual UNION view (Complex, harder to paginate/sort efficiently). | +| **Restore Integration** | Physical Adapter Row | Decouples Monitoring from Restore domain specifics. Allows uniform "list all runs" queries. The `context` JSON column will store `{ "restore_run_id": ... }`. | Polymorphic relation (Overhead for a single exception). | +| **Idempotency** | DB Partial Unique Index | Hard guarantee against race conditions. Simpler than distributed locks (Redis) which can expire or fail. | Redis Lock (Soft guarantee), Application check (Race prone). | +| **Initiator** | Nullable FK + Name | Handles both Users (FK) and System/Scheduler (Name "System") uniformly. | Polymorphic relation (Overkill for simple auditing). | + +## 3. Implementation Patterns + +### Canonical Run Lifecycle +1. **Start Request**: + - Compute `run_identity_hash` from inputs. + - Attempt `INSERT` into `operation_runs` (ignore conflict if active). + - If active run exists, return it (Idempotency). + - If new, dispatch Job. +2. **Job Execution**: + - Update status to `running`. + - Perform work. + - Update status to `succeeded`/`failed`. +3. **Restore Adapter**: + - When `RestoreRun` is created, create `OperationRun` (queued/running). + - When `RestoreRun` updates (status change), update `OperationRun`. + +### Data Model +```sql +CREATE TABLE operation_runs ( + id BIGSERIAL PRIMARY KEY, + tenant_id BIGINT NOT NULL REFERENCES tenants(id), + user_id BIGINT NULL REFERENCES users(id), -- Initiator + initiator_name VARCHAR(255) NOT NULL, -- "John Doe" or "System" + type VARCHAR(255) NOT NULL, -- "inventory.sync" + status VARCHAR(50) NOT NULL, -- queued, running, completed + outcome VARCHAR(50) NOT NULL, -- pending, succeeded, partially_succeeded, failed, cancelled + run_identity_hash VARCHAR(64) NOT NULL, -- SHA256(tenant_id + inputs) + summary_counts JSONB DEFAULT '{}', -- { success: 10, failed: 2 } + failure_summary JSONB DEFAULT '[]', -- [{ code: "ERR_TIMEOUT", message: "..." }] + context JSONB DEFAULT '{}', -- { selection: [...], restore_run_id: 123 } + started_at TIMESTAMP NULL, + completed_at TIMESTAMP NULL, + created_at TIMESTAMP, + updated_at TIMESTAMP +); + +CREATE UNIQUE INDEX operation_runs_active_unique +ON operation_runs (tenant_id, run_identity_hash) +WHERE status IN ('queued', 'running'); +``` + +## 4. Risks & Mitigations +- **Risk**: Desync between `RestoreRun` and `OperationRun`. + - **Mitigation**: Use model observers or service-layer wrapping to ensure atomic-like updates, or accept slight eventual consistency (Monitoring might lag ms behind Restore UI). +- **Risk**: Legacy runs not appearing. + - **Mitigation**: We are NOT backfilling legacy runs. Only new runs after deployment will appear in the new Monitoring UI. This is acceptable for "Phase 1". diff --git a/specs/054-unify-runs-suitewide/spec.md b/specs/054-unify-runs-suitewide/spec.md new file mode 100644 index 0000000..b54f7a3 --- /dev/null +++ b/specs/054-unify-runs-suitewide/spec.md @@ -0,0 +1,148 @@ +# Feature Specification: Unified Operations Runs Suitewide (Except Restore Domain Model) (054) + +**Feature Branch**: `feat/054-unify-operations-runs-suitewide` +**Created**: 2026-01-16 +**Status**: Draft +**Input**: User description: "Eliminate run sprawl by adopting one canonical tenant-scoped operation run record for long-running actions across the product, surfaced consistently in Monitoring → Operations, while keeping restore as a separate domain workflow that is still visible via an adapter entry." + +## Clarifications + +### Session 2026-01-16 + +- Q: Welche Default-Retention soll 054 für canonical Operation Runs festlegen? → A: 90 days +- Q: Transition-Strategie in 054: schreiben wir canonical Runs parallel zu Legacy-Run-Tabellen, oder ersetzen wir sofort? → A: Parallel write (canonical + legacy) +- Q: For `restore.execute`, the spec mentions it acts as an "adapter entry" linking to the restore domain record. How should this be implemented? → A: Physical Row (Create a physical row in `operation_runs` that points to the restore record). +- Q: How should concurrency and deduplication (FR-009) be enforced at the database level? → A: Partial Unique Index (unique constraint on `tenant_id, run_identity_hash` where outcome is `queued` or `running`). +- Q: How should the `initiator` be modeled to support both users and system processes (FR-001)? → A: Nullable FK + Name Snapshot (`user_id` nullable FK + required `initiator_name` string). + +## User Scenarios & Testing *(mandatory)* + +### User Story 1 - See Every Supported Operation in Monitoring (Priority: P1) + +As an operator, I want Monitoring → Operations to show all supported long-running operations for my tenant in one consistent list and detail view, so I can quickly answer what ran, who started it, whether it succeeded/partially succeeded/failed, and where to look next. + +**Why this priority**: This is the core value: a single, tenant-scoped source of truth for operational visibility. + +**Independent Test**: Trigger at least one run of each Phase 1 run producer, then verify each appears in Monitoring with consistent status/outcome semantics, safe failure summaries, and context links. + +**Acceptance Scenarios**: + +1. **Given** I am signed into tenant A, **When** I open Monitoring → Operations, **Then** I see only tenant A runs and can filter by run type, outcome bucket, time range, and initiator. +2. **Given** multiple run types exist, **When** I filter to `inventory.sync`, **Then** only inventory sync runs are shown. +3. **Given** a run exists, **When** I open its detail view, **Then** I can see initiator, run type, outcome bucket, timestamps, summary counts (if applicable), sanitized failures (if any), and links to relevant feature context/results. +4. **Given** restore execution exists, **When** I open Monitoring → Operations, **Then** I can see a `restore.execute` entry that links to the existing restore record (restore history remains owned by the restore domain record). +5. **Given** I am a `Readonly` user in tenant A, **When** I view Monitoring → Operations, **Then** I can view runs and details but I do not see any start/rerun/cancel/delete controls. +6. **Given** I attempt to access a run from another tenant (direct link or list), **When** I request it, **Then** access is denied and no run details are disclosed. + +--- + +### User Story 2 - Start Operations Without Blocking (Priority: P2) + +As an operator, when I start a supported operation, I want immediate confirmation and a “View run” link so I can continue working while the operation runs in the background. + +**Why this priority**: Removes long-running requests/timeouts and standardizes how operations are started and observed. + +**Independent Test**: Start each Phase 1 operation from its owning UI and confirm the start returns quickly, includes “View run”, and the run progresses through queued/running into a terminal outcome. + +**Acceptance Scenarios**: + +1. **Given** I have permission to start a Phase 1 operation in tenant A, **When** I start it, **Then** I receive immediate confirmation with a “View run” link and the run is visible as queued or running. +2. **Given** I am a `Readonly` user in tenant A, **When** I attempt to start any Phase 1 operation, **Then** the system denies the request and does not create a new run. +3. **Given** the run reaches a terminal outcome, **When** that occurs, **Then** the initiating user receives an in-app notification including a short summary and a “View run” link. +4. **Given** background processing is unavailable, **When** I attempt to start an operation, **Then** I receive a clear message and the system MUST NOT claim it was queued. + +--- + +### User Story 3 - Duplicate Starts Reuse the Same Active Run (Priority: P3) + +As an operator, I want accidental double-starts (double clicks, two admins, retries) to reuse the same active run so duplicate background work is avoided and results remain auditable. + +**Why this priority**: Reduces load, prevents confusing duplicate outcomes, and makes operations safer under concurrency. + +**Independent Test**: Start the same operation twice with identical effective inputs while the first is queued/running and verify the system reuses the active run. + +**Acceptance Scenarios**: + +1. **Given** an identical run is queued/running for a tenant, **When** another start request is made with the same effective inputs, **Then** the system reuses the existing run and does not start a second one. +2. **Given** two starts happen at nearly the same time, **When** the system resolves the race, **Then** at most one active run exists for that identity and both users are directed to it. + +### Edge Cases + +- Background execution unavailable: start fails fast with a clear message; the system MUST NOT create misleading “queued” runs. +- Partial processing: at least one success and at least one failure yields “partially succeeded”, with per-item failures when applicable. +- Large run history: Monitoring remains usable with filters and defaults (recent runs, last 30 days). +- Permissions revoked mid-run: the run continues; visibility is evaluated at time of access. + +## Requirements *(mandatory)* + +**Constitution alignment (required):** If this feature introduces any external tenant API calls or any write/change behavior, +the spec MUST describe contract registry updates, safety gates (preview/confirmation/audit), tenant isolation, and tests. + +### Scope & Assumptions + +**Phase 1 adoption set (must be implemented):** + +- `inventory.sync` (Inventory “Sync now”) +- `policy.sync` (Policies “Sync now”) +- `directory_groups.sync` (Directory → Groups “Sync groups”) +- `drift.generate` (Drift “Generate drift now” / auto-on-open when eligible) +- `backup_set.add_policies` (Backup Sets “Add selected” / “Add policies”) + +**Restore visibility (adapter only):** + +- `restore.execute` appears as a canonical run entry that links to an existing restore domain record. +- Restore execution history remains owned by the restore domain record (not replaced in Phase 1). + +**Out of scope for 054 (explicit):** + +- Cross-tenant compare/promotion +- UI redesign/styling polish (separate UI polish work) +- Cancel/rerun/delete controls inside Monitoring hub (hub stays view-only) +- Replacing restore domain records with canonical runs +- A full settings UI for retention/notifications/etc. + +**Assumptions (defaults to remove ambiguity in Phase 1):** + +- Canonical run history retention defaults to 90 days, with no user-facing retention configuration in 054. +- System-initiated runs (if any) do not notify users by default in Phase 1. +- Transition strategy: write canonical runs in parallel with any existing legacy per-module run tables (where they exist); Monitoring uses canonical runs as the source of truth immediately. + +### Functional Requirements + +- **FR-001 Canonical Operation Run**: System MUST represent each supported operation execution as a canonical, tenant-scoped operation run record that captures initiator (nullable `user_id` FK + `initiator_name` string), run type, lifecycle status/timestamps, outcome bucket, summary counts (when applicable), safe failure summaries, an idempotency identity for dedupe, and a safe context payload referencing “what this run was about”. +- **FR-002 Run taxonomy**: Run type MUST be stable and follow `"."`. +- **FR-003 Phase 1 run types**: Phase 1 run types MUST include `inventory.sync`, `policy.sync`, `directory_groups.sync`, `drift.generate`, `backup_set.add_policies`, plus `restore.execute` implemented as a physical `operation_runs` record (adapter) pointing to the domain entity. +- **FR-004 Monitoring lists all canonical runs**: Monitoring → Operations MUST list canonical runs for the active tenant with filters for run type, outcome bucket, time range, and initiator; default sort is most recent first; default time window is last 30 days. +- **FR-005 Run detail**: Run detail MUST show initiator, run type, outcome bucket, timestamps (created/started/finished), summary counts (when applicable), sanitized failures (including per-item failures when applicable), and contextual links to owning feature surfaces/results. +- **FR-006 View-only hub**: Monitoring hub MUST be view-only (no start/rerun/cancel/delete controls) and MUST link back to owning feature surfaces. +- **FR-007 Start surfaces always enqueue**: Every Phase 1 start surface MUST authorize start, create/reuse a canonical run (dedupe), dispatch background execution, and return immediately with confirmation + “View run”. +- **FR-008 No remote work in interactive request**: Start surfaces MUST NOT perform remote work inline; long-running work happens in background execution. +- **FR-009 Deterministic idempotency**: For each run type, the system MUST define a deterministic identity for “identical run” based on tenant + effective inputs; initiator MUST NOT be part of identity. **Enforcement**: Uniqueness MUST be enforced via a partial unique index on `(tenant_id, run_identity_hash)` where outcome is `queued` or `running`. +- **FR-010 Phase 1 identity rules**: Identity rules MUST be defined at least as follows: + - `inventory.sync`: tenant + selection scope + - `policy.sync`: tenant + effective policy scope + - `directory_groups.sync`: tenant + selection (Phase 1 default: “all groups”) + - `backup_set.add_policies`: tenant + backup set + selected policies + option flags (if exposed) + - `drift.generate`: tenant + scope key + baseline/current comparison inputs +- **FR-011 Outcome buckets**: Monitoring MUST present consistent outcome buckets: `queued`, `running`, `succeeded`, `partially succeeded`, `failed`. +- **FR-012 Partial vs failed**: “Partially succeeded” means at least one success and at least one failure; “Failed” means zero successes or cannot proceed. +- **FR-013 Failure details are safe + useful**: Failures MUST be persisted and displayed as stable reason codes and short sanitized messages; failures MUST NOT include secrets/tokens/credentials/PII or full external payload dumps. +- **FR-014 Related links**: Run detail MUST include contextual links where applicable (e.g., drift findings, backup set, inventory results, directory groups, restore detail for `restore.execute`). +- **FR-015 Notifications**: System MUST emit in-app notifications for “queued” (after start) and terminal outcomes for Phase 1 runs; notifications MUST include a short summary and a “View run” link; recipients are the initiating user only. +- **FR-016 Tenant isolation**: All run list/detail access MUST be tenant-scoped; cross-tenant access MUST be denied without disclosing run details. +- **FR-017 No render-time remote calls**: Monitoring pages MUST be render-safe and MUST NOT depend on external service calls during render. +- **FR-018 Roles & permissions**: Roles `Owner`, `Manager`, `Operator`, and `Readonly` MUST be able to view runs; only `Owner`, `Manager`, `Operator` may start operations; `Readonly` is strictly view-only. + +### Key Entities *(include if feature involves data)* + +- **Canonical Operation Run**: A tenant-scoped record representing the lifecycle of a long-running operation, including run type, initiator (nullable `user_id` FK + `initiator_name` string), lifecycle state/timestamps, outcome bucket, summary counts, safe failure summaries, idempotency identity (uniqueness enforced by DB index on active runs), and safe context references. +- **Restore domain record (exception)**: Restore remains a domain workflow record with richer semantics and history. Monitoring shows restore activity through a physical `operation_runs` row (adapter) that links back to the restore record, without replacing it. + +## Success Criteria *(mandatory)* + +### Measurable Outcomes + +- **SC-001**: Operators can answer “what ran, when, and did it succeed?” for any Phase 1 run in under 1 minute using Monitoring → Operations. +- **SC-002**: Starting a Phase 1 operation returns confirmation + “View run” link within 2 seconds under normal conditions. +- **SC-003**: Duplicate starts reuse the same active run in at least 99% of attempts under normal conditions. +- **SC-004**: No secrets/tokens/credentials/PII appear in persisted failures or notifications (verified by tests). diff --git a/specs/054-unify-runs-suitewide/tasks.md b/specs/054-unify-runs-suitewide/tasks.md new file mode 100644 index 0000000..fd8b909 --- /dev/null +++ b/specs/054-unify-runs-suitewide/tasks.md @@ -0,0 +1,64 @@ +# Tasks: Unified Operations Runs Suitewide + +**Feature**: `054-unify-runs-suitewide` +**Spec**: `specs/054-unify-runs-suitewide/spec.md` + +## Phase 1: Foundation (DB & Service) + +- [ ] **Migration**: Create `operation_runs` table with partial unique index on `(tenant_id, run_identity_hash)` where status in `queued, running`. +- [ ] **Model**: Create `OperationRun` model with casts (JSONB for summaries/context), relationship to `Tenant` and `User`. +- [ ] **Service**: Implement `OperationRunService::ensureRun()` (idempotent creation) and `updateRun()` methods. +- [ ] **Test**: Feature test for `ensureRun` verifying idempotency (same hash = same run) and concurrency safety (simulated). +- [ ] **Test**: Feature test for `updateRun` verifying status transitions and history logging (if any). +- [ ] **Job Middleware**: Create `TrackOperationRun` middleware to automatically handle job success/failure updates for jobs using this system. +- [ ] **Retention**: Create a daily scheduled job to prune `operation_runs` older than 90 days. + +## Phase 2: Monitoring UI (Read-Only) + +- [ ] **Page**: Create Filament Page `Monitoring/Operations` (List) strictly scoped to current tenant. +- [ ] **Table**: Implement `OperationRun` table with columns: Status (Badge), Operation Type, Initiator, Started At, Duration, Outcome. +- [ ] **Filters**: Add table filters for `Type`, `Outcome`, `Date Range`, `Initiator`. +- [ ] **Detail View**: Create "View Run" modal or separate page showing: + - Summary counts (Success/Fail/Total) + - Failure list (Sanitized codes/messages) + - Context JSON (Debug info) + - Timeline (Created/Started/Finished) +- [ ] **Test**: Livewire test verifying `Readonly` users can see table but no actions. +- [ ] **Test**: Verify cross-tenant access is blocked. + +## Phase 3: Producer Migration (Parallel Write) + +### Inventory Sync (`inventory.sync`) +- [ ] **Refactor**: Update `RunInventorySyncJob` dispatch logic to call `OperationRunService::ensureRun()` first. +- [ ] **Refactor**: Update Job to use `TrackOperationRun` middleware (or manual updates) to sync status to `operation_runs`. +- [ ] **Verify**: Ensure legacy `inventory_sync_runs` is still written to (if legacy UI depends on it) OR confirm legacy UI is replaced. *Decision: Parallel write as per spec.* + +### Policy Sync (`policy.sync`) +- [ ] **Refactor**: Update Policy Sync start logic to use `OperationRunService`. +- [ ] **Refactor**: Instrument Policy Sync job to update `operation_runs`. + +### Directory Groups Sync (`directory_groups.sync`) +- [ ] **Refactor**: Update Group Sync start logic to use `OperationRunService`. +- [ ] **Refactor**: Instrument Group Sync job to update `operation_runs`. + +### Drift Generation (`drift.generate`) +- [ ] **Refactor**: Update Drift Generation start logic to use `OperationRunService`. +- [ ] **Refactor**: Instrument Drift job to update `operation_runs`. + +### Backup Set (`backup_set.add_policies`) +- [ ] **Refactor**: Update "Add Policies" action to use `OperationRunService`. + +## Phase 4: Restore Adapter + +- [ ] **Listener**: Create `SyncRestoreRunToOperation` listener observing `RestoreRun` events (`created`, `updated`). +- [ ] **Logic**: Map `RestoreRun` status/outcomes to `OperationRun` schema. + - `RestoreRun` created -> `OperationRun` created (queued/running). + - `RestoreRun` updated -> `OperationRun` updated. +- [ ] **Context**: Store `{"restore_run_id": }` in `OperationRun.context`. +- [ ] **Test**: Verify creating a `RestoreRun` automatically spawns a shadow `OperationRun`. + +## Phase 5: Notifications & Polish + +- [ ] **Notifications**: Implement Database Notifications for "Run Started" (with link) and "Run Completed" (with outcome). +- [ ] **Frontend**: Ensure "View Run" link in Toast notifications correctly opens the Monitoring Detail view. +- [ ] **Final Verify**: Run through the `requirements.md` checklist manually. \ No newline at end of file