5.3 KiB
5.3 KiB
Research: Unified Operations Runs Suitewide
1. Technical Context & Unknowns
Unknowns Resolved:
- Transition Strategy: Parallel write. We will maintain existing legacy tables (e.g.,
inventory_sync_runs,restore_runs) for now but strictly useoperation_runsfor the Monitoring UI. - Restore Adapter:
RestoreRunremains the domain source of truth. AnOperationRunadapter row will be created once a restore run reachespreviewed(and later statuses), and will be kept in sync viaRestoreRunlifecycle events or service-layer wrapping. - Run Logic Location: Existing jobs like
RunInventorySyncJobwill be updated to manage theOperationRunstate. - Concurrency: Enforced by partial unique index on
(tenant_id, run_identity_hash)where status is active (queued,running). - Dispatch Failure Semantics: If queue dispatch fails, the system will immediately complete the run as
failed(e.g.,queue.dispatch_failed) and show a clear UI message (never leaving misleading queued runs). - Notifications on Dedupe: Only the original initiator (
operation_runs.user_id) receives queued/terminal notifications; reusers of an active run do not get additional notifications.
2. Technology Choices
| Area | Decision | Rationale | Alternatives |
|---|---|---|---|
| Schema | operation_runs table |
Centralized table allows simple, performant Monitoring queries without complex UNIONs across disparate legacy tables. | Virtual UNION view (Complex, harder to paginate/sort efficiently). |
| Restore Integration | Physical Adapter Row | Decouples Monitoring from Restore domain specifics. Allows uniform "list all runs" queries. The context JSON column will store { "restore_run_id": ... }. |
Polymorphic relation (Overhead for a single exception). |
| Idempotency | DB Partial Unique Index | Hard guarantee against race conditions. Simpler than distributed locks (Redis) which can expire or fail. | Redis Lock (Soft guarantee), Application check (Race prone). |
| Initiator | Nullable FK + Name | Handles both Users (FK) and System/Scheduler (Name "System") uniformly. | Polymorphic relation (Overkill for simple auditing). |
3. Implementation Patterns
Canonical Run Lifecycle
- Start Request:
- Compute
run_identity_hashfrom inputs. - Attempt
INSERTintooperation_runs(idempotent; enforced by partial unique index for active runs). - If an active run exists, return it (Idempotency).
- If new, dispatch the background Job.
- If dispatch fails, immediately mark the run
status=completed,outcome=failedwith a safe failure code such asqueue.dispatch_failed.
- Compute
- Job Execution:
- Update status to
running. - Perform work.
- Update status to
completedand set terminal outcome (succeeded/partially_succeeded/failed/cancelled).
- Update status to
- Restore Adapter:
- Create the adapter row only once
RestoreRunStatus=previewed(or later) is reached. - Map
RestoreRunStatus=previewedtoOperationRun.status=queuedandOperationRun.outcome=pending. - Keep the adapter updated as the restore progresses:
queued→status=queued,outcome=pendingrunning→status=running,outcome=pendingcompleted→status=completed,outcome=succeededpartial→status=completed,outcome=partially_succeededfailed→status=completed,outcome=failedcancelled→status=completed,outcome=cancelled
- Create the adapter row only once
Data Model
CREATE TABLE operation_runs (
id BIGSERIAL PRIMARY KEY,
tenant_id BIGINT NOT NULL REFERENCES tenants(id),
user_id BIGINT NULL REFERENCES users(id), -- Initiator
initiator_name VARCHAR(255) NOT NULL, -- "John Doe" or "System"
type VARCHAR(255) NOT NULL, -- "inventory.sync"
status VARCHAR(50) NOT NULL, -- queued, running, completed
outcome VARCHAR(50) NOT NULL, -- pending, succeeded, partially_succeeded, failed, cancelled
run_identity_hash VARCHAR(64) NOT NULL, -- SHA256(tenant_id + inputs)
summary_counts JSONB DEFAULT '{}', -- { success: 10, failed: 2 }
failure_summary JSONB DEFAULT '[]', -- [{ code: "ERR_TIMEOUT", message: "..." }]
context JSONB DEFAULT '{}', -- { selection: [...], restore_run_id: 123 }
started_at TIMESTAMP NULL,
completed_at TIMESTAMP NULL,
created_at TIMESTAMP,
updated_at TIMESTAMP
);
CREATE UNIQUE INDEX operation_runs_active_unique
ON operation_runs (tenant_id, run_identity_hash)
WHERE status IN ('queued', 'running');
4. Risks & Mitigations
- Risk: Desync between
RestoreRunandOperationRun.- Mitigation: Use model observers or service-layer wrapping to ensure atomic-like updates, or accept slight eventual consistency (Monitoring might lag ms behind Restore UI).
- Risk: Legacy runs not appearing.
- Mitigation: We are NOT backfilling legacy runs. Only new runs after deployment will appear in the new Monitoring UI. This is acceptable for "Phase 1".
- Risk: Confusion about
queuedfor restorepreviewed.- Mitigation: Document that
restore.executeappears frompreviewedonward and usesqueued/pendinguntil execution begins; Monitoring remains view-only and links to the restore domain detail.
- Mitigation: Document that