TenantAtlas/specs/054-unify-runs-suitewide/research.md

5.3 KiB

Research: Unified Operations Runs Suitewide

1. Technical Context & Unknowns

Unknowns Resolved:

  • Transition Strategy: Parallel write. We will maintain existing legacy tables (e.g., inventory_sync_runs, restore_runs) for now but strictly use operation_runs for the Monitoring UI.
  • Restore Adapter: RestoreRun remains the domain source of truth. An OperationRun adapter row will be created once a restore run reaches previewed (and later statuses), and will be kept in sync via RestoreRun lifecycle events or service-layer wrapping.
  • Run Logic Location: Existing jobs like RunInventorySyncJob will be updated to manage the OperationRun state.
  • Concurrency: Enforced by partial unique index on (tenant_id, run_identity_hash) where status is active (queued, running).
  • Dispatch Failure Semantics: If queue dispatch fails, the system will immediately complete the run as failed (e.g., queue.dispatch_failed) and show a clear UI message (never leaving misleading queued runs).
  • Notifications on Dedupe: Only the original initiator (operation_runs.user_id) receives queued/terminal notifications; reusers of an active run do not get additional notifications.

2. Technology Choices

Area Decision Rationale Alternatives
Schema operation_runs table Centralized table allows simple, performant Monitoring queries without complex UNIONs across disparate legacy tables. Virtual UNION view (Complex, harder to paginate/sort efficiently).
Restore Integration Physical Adapter Row Decouples Monitoring from Restore domain specifics. Allows uniform "list all runs" queries. The context JSON column will store { "restore_run_id": ... }. Polymorphic relation (Overhead for a single exception).
Idempotency DB Partial Unique Index Hard guarantee against race conditions. Simpler than distributed locks (Redis) which can expire or fail. Redis Lock (Soft guarantee), Application check (Race prone).
Initiator Nullable FK + Name Handles both Users (FK) and System/Scheduler (Name "System") uniformly. Polymorphic relation (Overkill for simple auditing).

3. Implementation Patterns

Canonical Run Lifecycle

  1. Start Request:
    • Compute run_identity_hash from inputs.
    • Attempt INSERT into operation_runs (idempotent; enforced by partial unique index for active runs).
    • If an active run exists, return it (Idempotency).
    • If new, dispatch the background Job.
    • If dispatch fails, immediately mark the run status=completed, outcome=failed with a safe failure code such as queue.dispatch_failed.
  2. Job Execution:
    • Update status to running.
    • Perform work.
    • Update status to completed and set terminal outcome (succeeded / partially_succeeded / failed / cancelled).
  3. Restore Adapter:
    • Create the adapter row only once RestoreRunStatus=previewed (or later) is reached.
    • Map RestoreRunStatus=previewed to OperationRun.status=queued and OperationRun.outcome=pending.
    • Keep the adapter updated as the restore progresses:
      • queuedstatus=queued, outcome=pending
      • runningstatus=running, outcome=pending
      • completedstatus=completed, outcome=succeeded
      • partialstatus=completed, outcome=partially_succeeded
      • failedstatus=completed, outcome=failed
      • cancelledstatus=completed, outcome=cancelled

Data Model

CREATE TABLE operation_runs (
    id BIGSERIAL PRIMARY KEY,
    tenant_id BIGINT NOT NULL REFERENCES tenants(id),
    user_id BIGINT NULL REFERENCES users(id), -- Initiator
    initiator_name VARCHAR(255) NOT NULL, -- "John Doe" or "System"
    type VARCHAR(255) NOT NULL, -- "inventory.sync"
    status VARCHAR(50) NOT NULL, -- queued, running, completed
    outcome VARCHAR(50) NOT NULL, -- pending, succeeded, partially_succeeded, failed, cancelled
    run_identity_hash VARCHAR(64) NOT NULL, -- SHA256(tenant_id + inputs)
    summary_counts JSONB DEFAULT '{}', -- { success: 10, failed: 2 }
    failure_summary JSONB DEFAULT '[]', -- [{ code: "ERR_TIMEOUT", message: "..." }]
    context JSONB DEFAULT '{}', -- { selection: [...], restore_run_id: 123 }
    started_at TIMESTAMP NULL,
    completed_at TIMESTAMP NULL,
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

CREATE UNIQUE INDEX operation_runs_active_unique 
ON operation_runs (tenant_id, run_identity_hash) 
WHERE status IN ('queued', 'running');

4. Risks & Mitigations

  • Risk: Desync between RestoreRun and OperationRun.
    • Mitigation: Use model observers or service-layer wrapping to ensure atomic-like updates, or accept slight eventual consistency (Monitoring might lag ms behind Restore UI).
  • Risk: Legacy runs not appearing.
    • Mitigation: We are NOT backfilling legacy runs. Only new runs after deployment will appear in the new Monitoring UI. This is acceptable for "Phase 1".
  • Risk: Confusion about queued for restore previewed.
    • Mitigation: Document that restore.execute appears from previewed onward and uses queued/pending until execution begins; Monitoring remains view-only and links to the restore domain detail.