docs: unified operations runs specs and plan (054)

This commit is contained in:
Ahmed Darrazi 2026-01-16 19:06:30 +01:00
parent 30ad57baab
commit 48b558db93
11 changed files with 613 additions and 0 deletions

View File

@ -669,3 +669,11 @@ ### Replaced Utilities
| decoration-slice | box-decoration-slice |
| decoration-clone | box-decoration-clone |
</laravel-boost-guidelines>
## Recent Changes
- 054-unify-runs-suitewide: Added PHP 8.4 + Filament v4, Laravel v12, Livewire v3
- 054-unify-runs-suitewide: Added [if applicable, e.g., PostgreSQL, CoreData, files or N/A]
- 054-unify-runs-suitewide: Added PHP 8.4 + Filament v4, Laravel v12, Livewire v3
## Active Technologies
- PostgreSQL (`operation_runs` table + JSONB) (054-unify-runs-suitewide)

View File

@ -0,0 +1,30 @@
# Requirements Checklist: Unified Operations Runs
## Phase 1 Adoption Set
- [x] `inventory.sync` (Inventory “Sync now”) covered in spec
- [x] `policy.sync` (Policies “Sync now”) covered in spec
- [x] `directory_groups.sync` (Directory → Groups “Sync groups”) covered in spec
- [x] `drift.generate` (Drift “Generate drift now”) covered in spec
- [x] `backup_set.add_policies` (Backup Sets “Add selected”) covered in spec
- [x] `restore.execute` (adapter mode) covered in spec
## Critical Clarifications (Pinned)
- [x] Retention policy defined (90 days default)
- [x] Transition strategy defined (Parallel write: Canonical + Legacy)
- [x] Concurrency enforcement defined (Partial unique index on active runs)
- [x] Initiator model defined (Nullable FK + Name Snapshot)
- [x] Restore integration defined (Physical adapter row pointing to Restore Domain record)
## Functional Requirements (Spec Coverage)
- [x] FR-001 Canonical Operation Run schema defined (see `data-model.md`)
- [x] FR-004 Monitoring List UI specified (filters/sort defined in Spec FR-004)
- [x] FR-005 Monitoring Detail UI specified (content defined in Spec FR-005)
- [x] FR-007 Start surfaces behavior specified (Spec FR-007)
- [x] FR-009 Idempotency (Partial Unique Index) strategy defined (Spec FR-009, Plan)
- [x] FR-015 Notifications for queued/terminal states specified (Spec FR-015)
- [x] FR-016 Tenant isolation rules specified (Spec FR-016)
## Non-Functional (Spec Coverage)
- [x] SC-002 Start confirmation < 2s target defined (Spec SC-002)
- [x] SC-003 Deduplication rate > 99% strategy defined (Spec SC-003)
- [x] SC-004 No secrets in failure logs rule defined (Spec SC-004)

View File

@ -0,0 +1,55 @@
openapi: 3.0.3
info:
title: TenantPilot Admin Operations Contracts (Feature 054)
version: 0.1.0
description: |
Minimal page-render contracts for the Monitoring/Operations hub.
These pages must render from the database only (no external tenant calls)
and display only sanitized failure detail (no secrets/tokens/raw payload dumps).
servers:
- url: /
paths:
/admin/t/{tenantExternalId}/bulk-operation-runs:
get:
operationId: monitoringOperationsIndex
summary: Monitoring → Operations (tenant-scoped)
parameters:
- name: tenantExternalId
in: path
required: true
schema:
type: string
responses:
'200':
description: Page renders successfully.
'302':
description: Redirect to login when unauthenticated.
/admin/t/{tenantExternalId}/bulk-operation-runs/{bulkOperationRunId}:
get:
operationId: monitoringOperationsView
summary: Operation run detail (tenant-scoped)
parameters:
- name: tenantExternalId
in: path
required: true
schema:
type: string
- name: bulkOperationRunId
in: path
required: true
schema:
type: integer
responses:
'200':
description: Page renders successfully.
'302':
description: Redirect to login when unauthenticated.
'403':
description: Forbidden when attempting cross-tenant access.
components: {}

View File

@ -0,0 +1,18 @@
# Routes & URLs
## Monitoring UI
### List Operations
- **Route**: `tenant.monitoring.operations.index`
- **URL**: `/tenants/{tenant}/monitoring/operations`
- **Controller**: Livewire Component (`App\Livewire\Monitoring\OperationsList`)
### View Operation
- **Route**: `tenant.monitoring.operations.show`
- **URL**: `/tenants/{tenant}/monitoring/operations/{run}`
- **Controller**: Livewire Component (`App\Livewire\Monitoring\OperationsDetail`)
## Deep Links
- **Drift**: `/tenants/{tenant}/drift/history/{id}`
- **Inventory**: `/tenants/{tenant}/inventory` (General, or specific timestamp if supported)
- **Restore**: `/tenants/{tenant}/restore/{id}`

View File

@ -0,0 +1,48 @@
# Service Interface: Operation Runs
## `App\Services\OperationRunService`
### `ensureRun`
Idempotently creates or retrieves an active run.
```php
public function ensureRun(
Tenant $tenant,
string $type,
array $inputs,
?User $initiator = null
): OperationRun
```
- **Logic**:
1. Compute `hash = sha256(tenant_id + type + sorted_json(inputs))`.
2. Try finding active run (`queued` or `running`) with this hash.
3. If found, return it.
4. If not found, create new `queued` run.
5. Return run.
### `updateRun`
Updates the status/outcome of a run.
```php
public function updateRun(
OperationRun $run,
string $status,
?string $outcome = null,
array $summaryCounts = [],
array $failures = []
): OperationRun
```
### `failRun`
Helper to fail a run immediately.
```php
public function failRun(OperationRun $run, Throwable $e): OperationRun
```
## `App\Jobs\Middleware\TrackOperationRun`
Middleware for Jobs to automatically handle `running` -> `completed`/`failed` transitions if bound to a run.
## `App\Listeners\SyncRestoreRunToOperation`
Listener for `RestoreRun` events to update the shadow `OperationRun`.

View File

@ -0,0 +1,52 @@
# Data Model: Unified Operations Runs
## Entities
### `OperationRun`
Canonical record for all long-running tenant operations.
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `id` | BigInt | Yes | Primary Key |
| `tenant_id` | BigInt | Yes | FK to Tenants |
| `user_id` | BigInt | No | FK to Users (Initiator). Null for system/scheduler. |
| `initiator_name` | String | Yes | Snapshot of user name or "System". |
| `type` | String | Yes | stable taxonomy e.g., `inventory.sync`. |
| `status` | String | Yes | Lifecycle state: `queued`, `running`, `completed`. |
| `outcome` | String | Yes | Result bucket: `pending`, `succeeded`, `partially_succeeded`, `failed`, `cancelled`. |
| `run_identity_hash` | String | Yes | Deterministic hash for idempotency. |
| `summary_counts` | JSONB | No | `{ "total": 10, "success": 8, "failed": 2, "skipped": 0 }` |
| `failure_summary` | JSONB | No | List of sanitized errors: `[{ "code": "GraphError", "message": "Throttled", "count": 1 }]` |
| `context` | JSONB | No | Run-specific metadata. e.g., `{ "restore_run_id": 123, "selection": [...] }` |
| `started_at` | Timestamp | No | When execution began. |
| `completed_at` | Timestamp | No | When execution finished. |
| `created_at` | Timestamp | Yes | |
| `updated_at` | Timestamp | Yes | |
**Indexes**:
- `(tenant_id, run_identity_hash)` UNIQUE WHERE status IN ('queued', 'running')
- `(tenant_id, type, created_at)` for filtering/sorting
- `(tenant_id, created_at)` for default sort
### `RestoreRun` (Existing)
Remains the domain source of truth for Restore.
- Linked via `OperationRun.context['restore_run_id']`.
- `OperationRun` mirrors `RestoreRun` status/outcome.
## Enums
### `OperationRunStatus`
- `queued`
- `running`
- `completed`
### `OperationRunOutcome`
- `pending` (default when running/queued)
- `succeeded`
- `partially_succeeded`
- `failed`
- `cancelled`
## Relationships
- `OperationRun` belongs to `Tenant`.
- `OperationRun` belongs to `User` (optional).

View File

@ -0,0 +1,76 @@
# Implementation Plan: Unified Operations Runs Suitewide
**Branch**: `feat/054-unify-operations-runs-suitewide` | **Date**: 2026-01-16 | **Spec**: [Spec Link](spec.md)
**Input**: Feature specification from `specs/054-unify-runs-suitewide/spec.md`
## Summary
This feature unifies long-running tenant operations (e.g., Inventory Sync, Drift Generation) into a single canonical `operation_runs` table. This enables a consistent "Monitoring -> Operations" view for all tenant activities. Legacy run tables will be maintained in parallel for now (Parallel Write Transition). `RestoreRun` remains a domain-specific record but will be mirrored into `operation_runs` via an adapter pattern.
## Technical Context
**Language/Version**: PHP 8.4
**Primary Dependencies**: Filament v4, Laravel v12, Livewire v3
**Storage**: PostgreSQL (`operation_runs` table + JSONB)
**Testing**: Pest v4 (Feature tests for Service, Livewire tests for UI)
**Target Platform**: Linux server (Docker/Dokploy)
**Project Type**: Web Application (Laravel Monolith)
**Performance Goals**: Start operation < 2s. List runs < 200ms.
**Constraints**: Tenant isolation is paramount. No cross-tenant data leakage.
**Scale/Scope**: ~50-100 runs/day per tenant. Retention 90 days.
## Constitution Check
*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.*
- [x] Inventory-first: N/A (this is about tracking operations, not inventory state itself)
- [x] Read/write separation: Monitoring is read-only. Starts are explicit writes.
- [x] Graph contract path: N/A (this feature tracks runs, doesn't call Graph directly)
- [x] Deterministic capabilities: N/A
- [x] Tenant isolation: `operation_runs` has `tenant_id`. Policies ensure scope.
- [x] Automation: Idempotency enforced via DB index.
- [x] Data minimization: No secrets in `failure_summary`.
## Project Structure
### Documentation (this feature)
```text
specs/054-unify-runs-suitewide/
├── plan.md # This file
├── research.md # Research findings
├── data-model.md # Database schema
├── quickstart.md # Dev guide
├── contracts/ # Service interfaces & routes
└── tasks.md # Task breakdown
```
### Source Code (repository root)
```text
app/
├── Models/
│ └── OperationRun.php
├── Services/
│ └── OperationRunService.php
├── Livewire/
│ └── Monitoring/
│ ├── OperationsList.php
│ └── OperationsDetail.php
├── Jobs/
│ └── Middleware/
│ └── TrackOperationRun.php
└── Listeners/
└── SyncRestoreRunToOperation.php
database/migrations/
└── YYYY_MM_DD_create_operation_runs_table.php
```
**Structure Decision**: Standard Laravel Service/Model/Livewire pattern.
## Complexity Tracking
| Violation | Why Needed | Simpler Alternative Rejected Because |
|-----------|------------|-------------------------------------|
| None | | |

View File

@ -0,0 +1,49 @@
# Quickstart: Adding a New Operation
## 1. Register Run Type
Add your new type constant to `App\Enums\OperationRunType` (if using Enums) or just use the string convention `resource.action`.
## 2. Implement Idempotency Inputs
Define what makes a run "unique" for your feature.
- Example: `['scope' => 'full']` vs `['scope' => 'policy', 'policy_id' => 1]`.
## 3. Use `OperationRunService`
In your Start Action (Controller/Livewire):
```php
// 1. Ensure Run
$run = $service->ensureRun($tenant, 'my_resource.action', $inputs, auth()->user());
// 2. Dispatch Job (if new)
if ($run->wasRecentlyCreated) {
MyJob::dispatch($run, $inputs);
}
// 3. Return View Link
return redirect()->route('tenant.monitoring.operations.show', [$tenant, $run]);
```
## 4. Instrument Job
In your Job:
```php
public function handle()
{
// Update to Running
$this->run->updateStatus(status: 'running');
try {
// ... do work ...
// Success
$this->run->updateStatus(
status: 'completed',
outcome: 'succeeded',
summary: ['processed' => 100]
);
} catch (\Throwable $e) {
// Failure
$this->run->fail($e);
}
}
```

View File

@ -0,0 +1,65 @@
# Research: Unified Operations Runs Suitewide
## 1. Technical Context & Unknowns
**Unknowns Resolved**:
- **Transition Strategy**: Parallel write. We will maintain existing legacy tables (e.g., `inventory_sync_runs`, `restore_runs`) for now but strictly use `operation_runs` for the Monitoring UI.
- **Restore Adapter**: `RestoreRun` remains the domain source of truth. An `OperationRun` record will be created as a "shadow" or "adapter" record. This requires hooking into `RestoreRun` lifecycle events or the service layer to keep them in sync.
- **Run Logic Location**: Existing jobs like `RunInventorySyncJob` will be updated to manage the `OperationRun` state.
- **Concurrency**: Enforced by partial unique index on `(tenant_id, run_identity_hash)` where status is active (`queued`, `running`).
## 2. Technology Choices
| Area | Decision | Rationale | Alternatives |
|------|----------|-----------|--------------|
| **Schema** | `operation_runs` table | Centralized table allows simple, performant Monitoring queries without complex UNIONs across disparate legacy tables. | Virtual UNION view (Complex, harder to paginate/sort efficiently). |
| **Restore Integration** | Physical Adapter Row | Decouples Monitoring from Restore domain specifics. Allows uniform "list all runs" queries. The `context` JSON column will store `{ "restore_run_id": ... }`. | Polymorphic relation (Overhead for a single exception). |
| **Idempotency** | DB Partial Unique Index | Hard guarantee against race conditions. Simpler than distributed locks (Redis) which can expire or fail. | Redis Lock (Soft guarantee), Application check (Race prone). |
| **Initiator** | Nullable FK + Name | Handles both Users (FK) and System/Scheduler (Name "System") uniformly. | Polymorphic relation (Overkill for simple auditing). |
## 3. Implementation Patterns
### Canonical Run Lifecycle
1. **Start Request**:
- Compute `run_identity_hash` from inputs.
- Attempt `INSERT` into `operation_runs` (ignore conflict if active).
- If active run exists, return it (Idempotency).
- If new, dispatch Job.
2. **Job Execution**:
- Update status to `running`.
- Perform work.
- Update status to `succeeded`/`failed`.
3. **Restore Adapter**:
- When `RestoreRun` is created, create `OperationRun` (queued/running).
- When `RestoreRun` updates (status change), update `OperationRun`.
### Data Model
```sql
CREATE TABLE operation_runs (
id BIGSERIAL PRIMARY KEY,
tenant_id BIGINT NOT NULL REFERENCES tenants(id),
user_id BIGINT NULL REFERENCES users(id), -- Initiator
initiator_name VARCHAR(255) NOT NULL, -- "John Doe" or "System"
type VARCHAR(255) NOT NULL, -- "inventory.sync"
status VARCHAR(50) NOT NULL, -- queued, running, completed
outcome VARCHAR(50) NOT NULL, -- pending, succeeded, partially_succeeded, failed, cancelled
run_identity_hash VARCHAR(64) NOT NULL, -- SHA256(tenant_id + inputs)
summary_counts JSONB DEFAULT '{}', -- { success: 10, failed: 2 }
failure_summary JSONB DEFAULT '[]', -- [{ code: "ERR_TIMEOUT", message: "..." }]
context JSONB DEFAULT '{}', -- { selection: [...], restore_run_id: 123 }
started_at TIMESTAMP NULL,
completed_at TIMESTAMP NULL,
created_at TIMESTAMP,
updated_at TIMESTAMP
);
CREATE UNIQUE INDEX operation_runs_active_unique
ON operation_runs (tenant_id, run_identity_hash)
WHERE status IN ('queued', 'running');
```
## 4. Risks & Mitigations
- **Risk**: Desync between `RestoreRun` and `OperationRun`.
- **Mitigation**: Use model observers or service-layer wrapping to ensure atomic-like updates, or accept slight eventual consistency (Monitoring might lag ms behind Restore UI).
- **Risk**: Legacy runs not appearing.
- **Mitigation**: We are NOT backfilling legacy runs. Only new runs after deployment will appear in the new Monitoring UI. This is acceptable for "Phase 1".

View File

@ -0,0 +1,148 @@
# Feature Specification: Unified Operations Runs Suitewide (Except Restore Domain Model) (054)
**Feature Branch**: `feat/054-unify-operations-runs-suitewide`
**Created**: 2026-01-16
**Status**: Draft
**Input**: User description: "Eliminate run sprawl by adopting one canonical tenant-scoped operation run record for long-running actions across the product, surfaced consistently in Monitoring → Operations, while keeping restore as a separate domain workflow that is still visible via an adapter entry."
## Clarifications
### Session 2026-01-16
- Q: Welche Default-Retention soll 054 für canonical Operation Runs festlegen? → A: 90 days
- Q: Transition-Strategie in 054: schreiben wir canonical Runs parallel zu Legacy-Run-Tabellen, oder ersetzen wir sofort? → A: Parallel write (canonical + legacy)
- Q: For `restore.execute`, the spec mentions it acts as an "adapter entry" linking to the restore domain record. How should this be implemented? → A: Physical Row (Create a physical row in `operation_runs` that points to the restore record).
- Q: How should concurrency and deduplication (FR-009) be enforced at the database level? → A: Partial Unique Index (unique constraint on `tenant_id, run_identity_hash` where outcome is `queued` or `running`).
- Q: How should the `initiator` be modeled to support both users and system processes (FR-001)? → A: Nullable FK + Name Snapshot (`user_id` nullable FK + required `initiator_name` string).
## User Scenarios & Testing *(mandatory)*
### User Story 1 - See Every Supported Operation in Monitoring (Priority: P1)
As an operator, I want Monitoring → Operations to show all supported long-running operations for my tenant in one consistent list and detail view, so I can quickly answer what ran, who started it, whether it succeeded/partially succeeded/failed, and where to look next.
**Why this priority**: This is the core value: a single, tenant-scoped source of truth for operational visibility.
**Independent Test**: Trigger at least one run of each Phase 1 run producer, then verify each appears in Monitoring with consistent status/outcome semantics, safe failure summaries, and context links.
**Acceptance Scenarios**:
1. **Given** I am signed into tenant A, **When** I open Monitoring → Operations, **Then** I see only tenant A runs and can filter by run type, outcome bucket, time range, and initiator.
2. **Given** multiple run types exist, **When** I filter to `inventory.sync`, **Then** only inventory sync runs are shown.
3. **Given** a run exists, **When** I open its detail view, **Then** I can see initiator, run type, outcome bucket, timestamps, summary counts (if applicable), sanitized failures (if any), and links to relevant feature context/results.
4. **Given** restore execution exists, **When** I open Monitoring → Operations, **Then** I can see a `restore.execute` entry that links to the existing restore record (restore history remains owned by the restore domain record).
5. **Given** I am a `Readonly` user in tenant A, **When** I view Monitoring → Operations, **Then** I can view runs and details but I do not see any start/rerun/cancel/delete controls.
6. **Given** I attempt to access a run from another tenant (direct link or list), **When** I request it, **Then** access is denied and no run details are disclosed.
---
### User Story 2 - Start Operations Without Blocking (Priority: P2)
As an operator, when I start a supported operation, I want immediate confirmation and a “View run” link so I can continue working while the operation runs in the background.
**Why this priority**: Removes long-running requests/timeouts and standardizes how operations are started and observed.
**Independent Test**: Start each Phase 1 operation from its owning UI and confirm the start returns quickly, includes “View run”, and the run progresses through queued/running into a terminal outcome.
**Acceptance Scenarios**:
1. **Given** I have permission to start a Phase 1 operation in tenant A, **When** I start it, **Then** I receive immediate confirmation with a “View run” link and the run is visible as queued or running.
2. **Given** I am a `Readonly` user in tenant A, **When** I attempt to start any Phase 1 operation, **Then** the system denies the request and does not create a new run.
3. **Given** the run reaches a terminal outcome, **When** that occurs, **Then** the initiating user receives an in-app notification including a short summary and a “View run” link.
4. **Given** background processing is unavailable, **When** I attempt to start an operation, **Then** I receive a clear message and the system MUST NOT claim it was queued.
---
### User Story 3 - Duplicate Starts Reuse the Same Active Run (Priority: P3)
As an operator, I want accidental double-starts (double clicks, two admins, retries) to reuse the same active run so duplicate background work is avoided and results remain auditable.
**Why this priority**: Reduces load, prevents confusing duplicate outcomes, and makes operations safer under concurrency.
**Independent Test**: Start the same operation twice with identical effective inputs while the first is queued/running and verify the system reuses the active run.
**Acceptance Scenarios**:
1. **Given** an identical run is queued/running for a tenant, **When** another start request is made with the same effective inputs, **Then** the system reuses the existing run and does not start a second one.
2. **Given** two starts happen at nearly the same time, **When** the system resolves the race, **Then** at most one active run exists for that identity and both users are directed to it.
### Edge Cases
- Background execution unavailable: start fails fast with a clear message; the system MUST NOT create misleading “queued” runs.
- Partial processing: at least one success and at least one failure yields “partially succeeded”, with per-item failures when applicable.
- Large run history: Monitoring remains usable with filters and defaults (recent runs, last 30 days).
- Permissions revoked mid-run: the run continues; visibility is evaluated at time of access.
## Requirements *(mandatory)*
**Constitution alignment (required):** If this feature introduces any external tenant API calls or any write/change behavior,
the spec MUST describe contract registry updates, safety gates (preview/confirmation/audit), tenant isolation, and tests.
### Scope & Assumptions
**Phase 1 adoption set (must be implemented):**
- `inventory.sync` (Inventory “Sync now”)
- `policy.sync` (Policies “Sync now”)
- `directory_groups.sync` (Directory → Groups “Sync groups”)
- `drift.generate` (Drift “Generate drift now” / auto-on-open when eligible)
- `backup_set.add_policies` (Backup Sets “Add selected” / “Add policies”)
**Restore visibility (adapter only):**
- `restore.execute` appears as a canonical run entry that links to an existing restore domain record.
- Restore execution history remains owned by the restore domain record (not replaced in Phase 1).
**Out of scope for 054 (explicit):**
- Cross-tenant compare/promotion
- UI redesign/styling polish (separate UI polish work)
- Cancel/rerun/delete controls inside Monitoring hub (hub stays view-only)
- Replacing restore domain records with canonical runs
- A full settings UI for retention/notifications/etc.
**Assumptions (defaults to remove ambiguity in Phase 1):**
- Canonical run history retention defaults to 90 days, with no user-facing retention configuration in 054.
- System-initiated runs (if any) do not notify users by default in Phase 1.
- Transition strategy: write canonical runs in parallel with any existing legacy per-module run tables (where they exist); Monitoring uses canonical runs as the source of truth immediately.
### Functional Requirements
- **FR-001 Canonical Operation Run**: System MUST represent each supported operation execution as a canonical, tenant-scoped operation run record that captures initiator (nullable `user_id` FK + `initiator_name` string), run type, lifecycle status/timestamps, outcome bucket, summary counts (when applicable), safe failure summaries, an idempotency identity for dedupe, and a safe context payload referencing “what this run was about”.
- **FR-002 Run taxonomy**: Run type MUST be stable and follow `"<resource>.<action>"`.
- **FR-003 Phase 1 run types**: Phase 1 run types MUST include `inventory.sync`, `policy.sync`, `directory_groups.sync`, `drift.generate`, `backup_set.add_policies`, plus `restore.execute` implemented as a physical `operation_runs` record (adapter) pointing to the domain entity.
- **FR-004 Monitoring lists all canonical runs**: Monitoring → Operations MUST list canonical runs for the active tenant with filters for run type, outcome bucket, time range, and initiator; default sort is most recent first; default time window is last 30 days.
- **FR-005 Run detail**: Run detail MUST show initiator, run type, outcome bucket, timestamps (created/started/finished), summary counts (when applicable), sanitized failures (including per-item failures when applicable), and contextual links to owning feature surfaces/results.
- **FR-006 View-only hub**: Monitoring hub MUST be view-only (no start/rerun/cancel/delete controls) and MUST link back to owning feature surfaces.
- **FR-007 Start surfaces always enqueue**: Every Phase 1 start surface MUST authorize start, create/reuse a canonical run (dedupe), dispatch background execution, and return immediately with confirmation + “View run”.
- **FR-008 No remote work in interactive request**: Start surfaces MUST NOT perform remote work inline; long-running work happens in background execution.
- **FR-009 Deterministic idempotency**: For each run type, the system MUST define a deterministic identity for “identical run” based on tenant + effective inputs; initiator MUST NOT be part of identity. **Enforcement**: Uniqueness MUST be enforced via a partial unique index on `(tenant_id, run_identity_hash)` where outcome is `queued` or `running`.
- **FR-010 Phase 1 identity rules**: Identity rules MUST be defined at least as follows:
- `inventory.sync`: tenant + selection scope
- `policy.sync`: tenant + effective policy scope
- `directory_groups.sync`: tenant + selection (Phase 1 default: “all groups”)
- `backup_set.add_policies`: tenant + backup set + selected policies + option flags (if exposed)
- `drift.generate`: tenant + scope key + baseline/current comparison inputs
- **FR-011 Outcome buckets**: Monitoring MUST present consistent outcome buckets: `queued`, `running`, `succeeded`, `partially succeeded`, `failed`.
- **FR-012 Partial vs failed**: “Partially succeeded” means at least one success and at least one failure; “Failed” means zero successes or cannot proceed.
- **FR-013 Failure details are safe + useful**: Failures MUST be persisted and displayed as stable reason codes and short sanitized messages; failures MUST NOT include secrets/tokens/credentials/PII or full external payload dumps.
- **FR-014 Related links**: Run detail MUST include contextual links where applicable (e.g., drift findings, backup set, inventory results, directory groups, restore detail for `restore.execute`).
- **FR-015 Notifications**: System MUST emit in-app notifications for “queued” (after start) and terminal outcomes for Phase 1 runs; notifications MUST include a short summary and a “View run” link; recipients are the initiating user only.
- **FR-016 Tenant isolation**: All run list/detail access MUST be tenant-scoped; cross-tenant access MUST be denied without disclosing run details.
- **FR-017 No render-time remote calls**: Monitoring pages MUST be render-safe and MUST NOT depend on external service calls during render.
- **FR-018 Roles & permissions**: Roles `Owner`, `Manager`, `Operator`, and `Readonly` MUST be able to view runs; only `Owner`, `Manager`, `Operator` may start operations; `Readonly` is strictly view-only.
### Key Entities *(include if feature involves data)*
- **Canonical Operation Run**: A tenant-scoped record representing the lifecycle of a long-running operation, including run type, initiator (nullable `user_id` FK + `initiator_name` string), lifecycle state/timestamps, outcome bucket, summary counts, safe failure summaries, idempotency identity (uniqueness enforced by DB index on active runs), and safe context references.
- **Restore domain record (exception)**: Restore remains a domain workflow record with richer semantics and history. Monitoring shows restore activity through a physical `operation_runs` row (adapter) that links back to the restore record, without replacing it.
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-001**: Operators can answer “what ran, when, and did it succeed?” for any Phase 1 run in under 1 minute using Monitoring → Operations.
- **SC-002**: Starting a Phase 1 operation returns confirmation + “View run” link within 2 seconds under normal conditions.
- **SC-003**: Duplicate starts reuse the same active run in at least 99% of attempts under normal conditions.
- **SC-004**: No secrets/tokens/credentials/PII appear in persisted failures or notifications (verified by tests).

View File

@ -0,0 +1,64 @@
# Tasks: Unified Operations Runs Suitewide
**Feature**: `054-unify-runs-suitewide`
**Spec**: `specs/054-unify-runs-suitewide/spec.md`
## Phase 1: Foundation (DB & Service)
- [ ] **Migration**: Create `operation_runs` table with partial unique index on `(tenant_id, run_identity_hash)` where status in `queued, running`.
- [ ] **Model**: Create `OperationRun` model with casts (JSONB for summaries/context), relationship to `Tenant` and `User`.
- [ ] **Service**: Implement `OperationRunService::ensureRun()` (idempotent creation) and `updateRun()` methods.
- [ ] **Test**: Feature test for `ensureRun` verifying idempotency (same hash = same run) and concurrency safety (simulated).
- [ ] **Test**: Feature test for `updateRun` verifying status transitions and history logging (if any).
- [ ] **Job Middleware**: Create `TrackOperationRun` middleware to automatically handle job success/failure updates for jobs using this system.
- [ ] **Retention**: Create a daily scheduled job to prune `operation_runs` older than 90 days.
## Phase 2: Monitoring UI (Read-Only)
- [ ] **Page**: Create Filament Page `Monitoring/Operations` (List) strictly scoped to current tenant.
- [ ] **Table**: Implement `OperationRun` table with columns: Status (Badge), Operation Type, Initiator, Started At, Duration, Outcome.
- [ ] **Filters**: Add table filters for `Type`, `Outcome`, `Date Range`, `Initiator`.
- [ ] **Detail View**: Create "View Run" modal or separate page showing:
- Summary counts (Success/Fail/Total)
- Failure list (Sanitized codes/messages)
- Context JSON (Debug info)
- Timeline (Created/Started/Finished)
- [ ] **Test**: Livewire test verifying `Readonly` users can see table but no actions.
- [ ] **Test**: Verify cross-tenant access is blocked.
## Phase 3: Producer Migration (Parallel Write)
### Inventory Sync (`inventory.sync`)
- [ ] **Refactor**: Update `RunInventorySyncJob` dispatch logic to call `OperationRunService::ensureRun()` first.
- [ ] **Refactor**: Update Job to use `TrackOperationRun` middleware (or manual updates) to sync status to `operation_runs`.
- [ ] **Verify**: Ensure legacy `inventory_sync_runs` is still written to (if legacy UI depends on it) OR confirm legacy UI is replaced. *Decision: Parallel write as per spec.*
### Policy Sync (`policy.sync`)
- [ ] **Refactor**: Update Policy Sync start logic to use `OperationRunService`.
- [ ] **Refactor**: Instrument Policy Sync job to update `operation_runs`.
### Directory Groups Sync (`directory_groups.sync`)
- [ ] **Refactor**: Update Group Sync start logic to use `OperationRunService`.
- [ ] **Refactor**: Instrument Group Sync job to update `operation_runs`.
### Drift Generation (`drift.generate`)
- [ ] **Refactor**: Update Drift Generation start logic to use `OperationRunService`.
- [ ] **Refactor**: Instrument Drift job to update `operation_runs`.
### Backup Set (`backup_set.add_policies`)
- [ ] **Refactor**: Update "Add Policies" action to use `OperationRunService`.
## Phase 4: Restore Adapter
- [ ] **Listener**: Create `SyncRestoreRunToOperation` listener observing `RestoreRun` events (`created`, `updated`).
- [ ] **Logic**: Map `RestoreRun` status/outcomes to `OperationRun` schema.
- `RestoreRun` created -> `OperationRun` created (queued/running).
- `RestoreRun` updated -> `OperationRun` updated.
- [ ] **Context**: Store `{"restore_run_id": <id>}` in `OperationRun.context`.
- [ ] **Test**: Verify creating a `RestoreRun` automatically spawns a shadow `OperationRun`.
## Phase 5: Notifications & Polish
- [ ] **Notifications**: Implement Database Notifications for "Run Started" (with link) and "Run Completed" (with outcome).
- [ ] **Frontend**: Ensure "View Run" link in Toast notifications correctly opens the Monitoring Detail view.
- [ ] **Final Verify**: Run through the `requirements.md` checklist manually.