TenantAtlas/specs/054-unify-runs-suitewide/research.md
2026-01-16 19:06:30 +01:00

3.8 KiB

Research: Unified Operations Runs Suitewide

1. Technical Context & Unknowns

Unknowns Resolved:

  • Transition Strategy: Parallel write. We will maintain existing legacy tables (e.g., inventory_sync_runs, restore_runs) for now but strictly use operation_runs for the Monitoring UI.
  • Restore Adapter: RestoreRun remains the domain source of truth. An OperationRun record will be created as a "shadow" or "adapter" record. This requires hooking into RestoreRun lifecycle events or the service layer to keep them in sync.
  • Run Logic Location: Existing jobs like RunInventorySyncJob will be updated to manage the OperationRun state.
  • Concurrency: Enforced by partial unique index on (tenant_id, run_identity_hash) where status is active (queued, running).

2. Technology Choices

Area Decision Rationale Alternatives
Schema operation_runs table Centralized table allows simple, performant Monitoring queries without complex UNIONs across disparate legacy tables. Virtual UNION view (Complex, harder to paginate/sort efficiently).
Restore Integration Physical Adapter Row Decouples Monitoring from Restore domain specifics. Allows uniform "list all runs" queries. The context JSON column will store { "restore_run_id": ... }. Polymorphic relation (Overhead for a single exception).
Idempotency DB Partial Unique Index Hard guarantee against race conditions. Simpler than distributed locks (Redis) which can expire or fail. Redis Lock (Soft guarantee), Application check (Race prone).
Initiator Nullable FK + Name Handles both Users (FK) and System/Scheduler (Name "System") uniformly. Polymorphic relation (Overkill for simple auditing).

3. Implementation Patterns

Canonical Run Lifecycle

  1. Start Request:
    • Compute run_identity_hash from inputs.
    • Attempt INSERT into operation_runs (ignore conflict if active).
    • If active run exists, return it (Idempotency).
    • If new, dispatch Job.
  2. Job Execution:
    • Update status to running.
    • Perform work.
    • Update status to succeeded/failed.
  3. Restore Adapter:
    • When RestoreRun is created, create OperationRun (queued/running).
    • When RestoreRun updates (status change), update OperationRun.

Data Model

CREATE TABLE operation_runs (
    id BIGSERIAL PRIMARY KEY,
    tenant_id BIGINT NOT NULL REFERENCES tenants(id),
    user_id BIGINT NULL REFERENCES users(id), -- Initiator
    initiator_name VARCHAR(255) NOT NULL, -- "John Doe" or "System"
    type VARCHAR(255) NOT NULL, -- "inventory.sync"
    status VARCHAR(50) NOT NULL, -- queued, running, completed
    outcome VARCHAR(50) NOT NULL, -- pending, succeeded, partially_succeeded, failed, cancelled
    run_identity_hash VARCHAR(64) NOT NULL, -- SHA256(tenant_id + inputs)
    summary_counts JSONB DEFAULT '{}', -- { success: 10, failed: 2 }
    failure_summary JSONB DEFAULT '[]', -- [{ code: "ERR_TIMEOUT", message: "..." }]
    context JSONB DEFAULT '{}', -- { selection: [...], restore_run_id: 123 }
    started_at TIMESTAMP NULL,
    completed_at TIMESTAMP NULL,
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

CREATE UNIQUE INDEX operation_runs_active_unique 
ON operation_runs (tenant_id, run_identity_hash) 
WHERE status IN ('queued', 'running');

4. Risks & Mitigations

  • Risk: Desync between RestoreRun and OperationRun.
    • Mitigation: Use model observers or service-layer wrapping to ensure atomic-like updates, or accept slight eventual consistency (Monitoring might lag ms behind Restore UI).
  • Risk: Legacy runs not appearing.
    • Mitigation: We are NOT backfilling legacy runs. Only new runs after deployment will appear in the new Monitoring UI. This is acceptable for "Phase 1".