# Research: Unified Operations Runs Suitewide ## 1. Technical Context & Unknowns **Unknowns Resolved**: - **Transition Strategy**: Parallel write. We will maintain existing legacy tables (e.g., `inventory_sync_runs`, `restore_runs`) for now but strictly use `operation_runs` for the Monitoring UI. - **Restore Adapter**: `RestoreRun` remains the domain source of truth. An `OperationRun` adapter row will be created once a restore run reaches `previewed` (and later statuses), and will be kept in sync via `RestoreRun` lifecycle events or service-layer wrapping. - **Run Logic Location**: Existing jobs like `RunInventorySyncJob` will be updated to manage the `OperationRun` state. - **Concurrency**: Enforced by partial unique index on `(tenant_id, run_identity_hash)` where status is active (`queued`, `running`). - **Dispatch Failure Semantics**: If queue dispatch fails, the system will immediately complete the run as `failed` (e.g., `queue.dispatch_failed`) and show a clear UI message (never leaving misleading queued runs). - **Notifications on Dedupe**: Only the original initiator (`operation_runs.user_id`) receives queued/terminal notifications; reusers of an active run do not get additional notifications. ## 2. Technology Choices | Area | Decision | Rationale | Alternatives | |------|----------|-----------|--------------| | **Schema** | `operation_runs` table | Centralized table allows simple, performant Monitoring queries without complex UNIONs across disparate legacy tables. | Virtual UNION view (Complex, harder to paginate/sort efficiently). | | **Restore Integration** | Physical Adapter Row | Decouples Monitoring from Restore domain specifics. Allows uniform "list all runs" queries. The `context` JSON column will store `{ "restore_run_id": ... }`. | Polymorphic relation (Overhead for a single exception). | | **Idempotency** | DB Partial Unique Index | Hard guarantee against race conditions. Simpler than distributed locks (Redis) which can expire or fail. | Redis Lock (Soft guarantee), Application check (Race prone). | | **Initiator** | Nullable FK + Name | Handles both Users (FK) and System/Scheduler (Name "System") uniformly. | Polymorphic relation (Overkill for simple auditing). | ## 3. Implementation Patterns ### Canonical Run Lifecycle 1. **Start Request**: - Compute `run_identity_hash` from inputs. - Attempt `INSERT` into `operation_runs` (idempotent; enforced by partial unique index for active runs). - If an active run exists, return it (Idempotency). - If new, dispatch the background Job. - If dispatch fails, immediately mark the run `status=completed`, `outcome=failed` with a safe failure code such as `queue.dispatch_failed`. 2. **Job Execution**: - Update status to `running`. - Perform work. - Update status to `completed` and set terminal outcome (`succeeded` / `partially_succeeded` / `failed` / `cancelled`). 3. **Restore Adapter**: - Create the adapter row only once `RestoreRunStatus=previewed` (or later) is reached. - Map `RestoreRunStatus=previewed` to `OperationRun.status=queued` and `OperationRun.outcome=pending`. - Keep the adapter updated as the restore progresses: - `queued` → `status=queued`, `outcome=pending` - `running` → `status=running`, `outcome=pending` - `completed` → `status=completed`, `outcome=succeeded` - `partial` → `status=completed`, `outcome=partially_succeeded` - `failed` → `status=completed`, `outcome=failed` - `cancelled` → `status=completed`, `outcome=cancelled` ### Data Model ```sql CREATE TABLE operation_runs ( id BIGSERIAL PRIMARY KEY, tenant_id BIGINT NOT NULL REFERENCES tenants(id), user_id BIGINT NULL REFERENCES users(id), -- Initiator initiator_name VARCHAR(255) NOT NULL, -- "John Doe" or "System" type VARCHAR(255) NOT NULL, -- "inventory.sync" status VARCHAR(50) NOT NULL, -- queued, running, completed outcome VARCHAR(50) NOT NULL, -- pending, succeeded, partially_succeeded, failed, cancelled run_identity_hash VARCHAR(64) NOT NULL, -- SHA256(tenant_id + inputs) summary_counts JSONB DEFAULT '{}', -- { success: 10, failed: 2 } failure_summary JSONB DEFAULT '[]', -- [{ code: "ERR_TIMEOUT", message: "..." }] context JSONB DEFAULT '{}', -- { selection: [...], restore_run_id: 123 } started_at TIMESTAMP NULL, completed_at TIMESTAMP NULL, created_at TIMESTAMP, updated_at TIMESTAMP ); CREATE UNIQUE INDEX operation_runs_active_unique ON operation_runs (tenant_id, run_identity_hash) WHERE status IN ('queued', 'running'); ``` ## 4. Risks & Mitigations - **Risk**: Desync between `RestoreRun` and `OperationRun`. - **Mitigation**: Use model observers or service-layer wrapping to ensure atomic-like updates, or accept slight eventual consistency (Monitoring might lag ms behind Restore UI). - **Risk**: Legacy runs not appearing. - **Mitigation**: We are NOT backfilling legacy runs. Only new runs after deployment will appear in the new Monitoring UI. This is acceptable for "Phase 1". - **Risk**: Confusion about `queued` for restore `previewed`. - **Mitigation**: Document that `restore.execute` appears from `previewed` onward and uses `queued/pending` until execution begins; Monitoring remains view-only and links to the restore domain detail.