Research: Unified Operations Runs Suitewide

1. Technical Context & Unknowns

Unknowns Resolved:

Transition Strategy: Parallel write. We will maintain existing legacy tables (e.g., inventory_sync_runs, restore_runs) for now but strictly use operation_runs for the Monitoring UI.
Restore Adapter: RestoreRun remains the domain source of truth. An OperationRun record will be created as a "shadow" or "adapter" record. This requires hooking into RestoreRun lifecycle events or the service layer to keep them in sync.
Run Logic Location: Existing jobs like RunInventorySyncJob will be updated to manage the OperationRun state.
Concurrency: Enforced by partial unique index on (tenant_id, run_identity_hash) where status is active (queued, running).

2. Technology Choices

Area	Decision	Rationale	Alternatives
Schema	`operation_runs` table	Centralized table allows simple, performant Monitoring queries without complex UNIONs across disparate legacy tables.	Virtual UNION view (Complex, harder to paginate/sort efficiently).
Restore Integration	Physical Adapter Row	Decouples Monitoring from Restore domain specifics. Allows uniform "list all runs" queries. The `context` JSON column will store `{ "restore_run_id": ... }`.	Polymorphic relation (Overhead for a single exception).
Idempotency	DB Partial Unique Index	Hard guarantee against race conditions. Simpler than distributed locks (Redis) which can expire or fail.	Redis Lock (Soft guarantee), Application check (Race prone).
Initiator	Nullable FK + Name	Handles both Users (FK) and System/Scheduler (Name "System") uniformly.	Polymorphic relation (Overkill for simple auditing).

3. Implementation Patterns

Canonical Run Lifecycle

Start Request:
- Compute run_identity_hash from inputs.
- Attempt INSERT into operation_runs (ignore conflict if active).
- If active run exists, return it (Idempotency).
- If new, dispatch Job.
Job Execution:
- Update status to running.
- Perform work.
- Update status to succeeded/failed.
Restore Adapter:
- When RestoreRun is created, create OperationRun (queued/running).
- When RestoreRun updates (status change), update OperationRun.

Data Model

CREATE TABLE operation_runs (
    id BIGSERIAL PRIMARY KEY,
    tenant_id BIGINT NOT NULL REFERENCES tenants(id),
    user_id BIGINT NULL REFERENCES users(id), -- Initiator
    initiator_name VARCHAR(255) NOT NULL, -- "John Doe" or "System"
    type VARCHAR(255) NOT NULL, -- "inventory.sync"
    status VARCHAR(50) NOT NULL, -- queued, running, completed
    outcome VARCHAR(50) NOT NULL, -- pending, succeeded, partially_succeeded, failed, cancelled
    run_identity_hash VARCHAR(64) NOT NULL, -- SHA256(tenant_id + inputs)
    summary_counts JSONB DEFAULT '{}', -- { success: 10, failed: 2 }
    failure_summary JSONB DEFAULT '[]', -- [{ code: "ERR_TIMEOUT", message: "..." }]
    context JSONB DEFAULT '{}', -- { selection: [...], restore_run_id: 123 }
    started_at TIMESTAMP NULL,
    completed_at TIMESTAMP NULL,
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

CREATE UNIQUE INDEX operation_runs_active_unique 
ON operation_runs (tenant_id, run_identity_hash) 
WHERE status IN ('queued', 'running');

4. Risks & Mitigations

Risk: Desync between RestoreRun and OperationRun.
- Mitigation: Use model observers or service-layer wrapping to ensure atomic-like updates, or accept slight eventual consistency (Monitoring might lag ms behind Restore UI).
Risk: Legacy runs not appearing.
- Mitigation: We are NOT backfilling legacy runs. Only new runs after deployment will appear in the new Monitoring UI. This is acceptable for "Phase 1".

3.8 KiB Raw Blame History

Research: Unified Operations Runs Suitewide

1. Technical Context & Unknowns

2. Technology Choices

3. Implementation Patterns

Canonical Run Lifecycle

Data Model

4. Risks & Mitigations

3.8 KiB

Raw Blame History