TenantAtlas/specs/054-unify-runs-suitewide/research.md

# Research: Unified Operations Runs Suitewide

## 1. Technical Context & Unknowns

**Unknowns Resolved**:
- **Transition Strategy**: Parallel write. We will maintain existing legacy tables (e.g., `inventory_sync_runs`, `restore_runs`) for now but strictly use `operation_runs` for the Monitoring UI.
- **Restore Adapter**: `RestoreRun` remains the domain source of truth. An `OperationRun` record will be created as a "shadow" or "adapter" record. This requires hooking into `RestoreRun` lifecycle events or the service layer to keep them in sync.
- **Run Logic Location**: Existing jobs like `RunInventorySyncJob` will be updated to manage the `OperationRun` state.
- **Concurrency**: Enforced by partial unique index on `(tenant_id, run_identity_hash)` where status is active (`queued`, `running`).

## 2. Technology Choices

| Area | Decision | Rationale | Alternatives |
|------|----------|-----------|--------------|
| **Schema** | `operation_runs` table | Centralized table allows simple, performant Monitoring queries without complex UNIONs across disparate legacy tables. | Virtual UNION view (Complex, harder to paginate/sort efficiently). |
| **Restore Integration** | Physical Adapter Row | Decouples Monitoring from Restore domain specifics. Allows uniform "list all runs" queries. The `context` JSON column will store `{ "restore_run_id": ... }`. | Polymorphic relation (Overhead for a single exception). |
| **Idempotency** | DB Partial Unique Index | Hard guarantee against race conditions. Simpler than distributed locks (Redis) which can expire or fail. | Redis Lock (Soft guarantee), Application check (Race prone). |
| **Initiator** | Nullable FK + Name | Handles both Users (FK) and System/Scheduler (Name "System") uniformly. | Polymorphic relation (Overkill for simple auditing). |

## 3. Implementation Patterns

### Canonical Run Lifecycle
1.  **Start Request**:
    -   Compute `run_identity_hash` from inputs.
    -   Attempt `INSERT` into `operation_runs` (ignore conflict if active).
    -   If active run exists, return it (Idempotency).
    -   If new, dispatch Job.
2.  **Job Execution**:
    -   Update status to `running`.
    -   Perform work.
    -   Update status to `succeeded`/`failed`.
3.  **Restore Adapter**:
    -   When `RestoreRun` is created, create `OperationRun` (queued/running).
    -   When `RestoreRun` updates (status change), update `OperationRun`.

### Data Model
```sql
CREATE TABLE operation_runs (
    id BIGSERIAL PRIMARY KEY,
    tenant_id BIGINT NOT NULL REFERENCES tenants(id),
    user_id BIGINT NULL REFERENCES users(id), -- Initiator
    initiator_name VARCHAR(255) NOT NULL, -- "John Doe" or "System"
    type VARCHAR(255) NOT NULL, -- "inventory.sync"
    status VARCHAR(50) NOT NULL, -- queued, running, completed
    outcome VARCHAR(50) NOT NULL, -- pending, succeeded, partially_succeeded, failed, cancelled
    run_identity_hash VARCHAR(64) NOT NULL, -- SHA256(tenant_id + inputs)
    summary_counts JSONB DEFAULT '{}', -- { success: 10, failed: 2 }
    failure_summary JSONB DEFAULT '[]', -- [{ code: "ERR_TIMEOUT", message: "..." }]
    context JSONB DEFAULT '{}', -- { selection: [...], restore_run_id: 123 }
    started_at TIMESTAMP NULL,
    completed_at TIMESTAMP NULL,
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

CREATE UNIQUE INDEX operation_runs_active_unique
ON operation_runs (tenant_id, run_identity_hash)
WHERE status IN ('queued', 'running');
```

## 4. Risks & Mitigations
-   **Risk**: Desync between `RestoreRun` and `OperationRun`.
    -   **Mitigation**: Use model observers or service-layer wrapping to ensure atomic-like updates, or accept slight eventual consistency (Monitoring might lag ms behind Restore UI).
-   **Risk**: Legacy runs not appearing.
    -   **Mitigation**: We are NOT backfilling legacy runs. Only new runs after deployment will appear in the new Monitoring UI. This is acceptable for "Phase 1".