# Phase 1 Data Model: Operation Lifecycle Guarantees & Queue-to-Domain Failure Reconciliation ## Overview This feature does not require a new database table in the first implementation slice. The primary data-model work is the formalization of existing `OperationRun` persistence plus new derived lifecycle-policy and freshness concepts that make queue truth, domain truth, and operator-visible truth converge deterministically. ## Persistent Domain Entities ### OperationRun **Purpose**: Canonical workspace-scoped operational record for long-running, queued, scheduled, or otherwise operator-visible work. **Key fields**: - `id` - `workspace_id` - `tenant_id` nullable - `user_id` nullable - `type` - `status` with canonical values `queued`, `running`, `completed` - `outcome` with canonical values including `pending`, `succeeded`, `partially_succeeded`, `blocked`, `failed` - `run_identity_hash` - `summary_counts` JSONB - `failure_summary` JSONB array - `context` JSONB - `started_at` - `completed_at` - `created_at` - `updated_at` **Relationships**: - Belongs to one workspace - Optionally belongs to one tenant - Optionally belongs to one initiating user **Validation rules relevant to this feature**: - `status` and `outcome` transitions remain service-owned via `OperationRunService`. - Non-terminal active runs remain constrained by the existing active-run unique index semantics. - `summary_counts` keys remain from the canonical summary key catalog and values remain numeric-only. - Reconciliation metadata must be stored in a standardized structure inside `context` and `failure_summary` without persisting secrets. **State transitions relevant to this feature**: - `queued/pending` → `running/pending` - `queued/pending` → `completed/failed` when stale queued reconciliation or direct queue-failure bridging resolves an orphaned queued run - `running/pending` → `completed/succeeded|partially_succeeded|blocked|failed` - `running/pending` → `completed/failed` when stale running reconciliation resolves an orphaned active run - `completed/*` is terminal and must never be mutated by reconciliation ### Failed Job Record (`failed_jobs`) **Purpose**: Infrastructure-level evidence that a queued job exhausted attempts, timed out, or otherwise failed. **Key fields used conceptually**: - UUID or failed-job identifier - connection - queue - payload - exception - failed_at **Relationships**: - Not directly related through a foreign key to `OperationRun` - Linked back to `OperationRun` through job-owned identity resolution or reconciliation evidence **Validation rules relevant to this feature**: - A failed-job record is evidence, not operator-facing truth by itself. - Evidence may inform reconciliation or diagnostics but must not replace the domain transition on `OperationRun`. ### Queue Job Definition **Purpose**: The queued class that owns or advances a covered `OperationRun`. **Key lifecycle-relevant properties**: - `operationRun` reference or `getOperationRun()` contract - optional `$timeout` - optional `$failOnTimeout` - optional `$tries` or `retryUntil()` - `middleware()` including `TrackOperationRun` and other queue middleware - optional `failed(Throwable $e)` callback **Validation rules relevant to this feature**: - Covered jobs must provide a credible path to terminal truth through direct failure bridging, fallback reconciliation, or both. - Covered long-running jobs must have intentional timeout behavior and must be compatible with queue timing invariants. ## New Derived Domain Objects ### OperationLifecyclePolicy **Purpose**: Configuration-backed policy describing which operation types are in scope for V1 and how their lifecycle should be evaluated. **Fields**: - `operationType` - `covered` boolean - `queuedStaleAfterSeconds` - `runningStaleAfterSeconds` - `expectedMaxRuntimeSeconds` - `requiresDirectFailedBridge` boolean - `supportsReconciliation` boolean **Validation rules**: - `queuedStaleAfterSeconds` and `runningStaleAfterSeconds` must be positive integers. - `expectedMaxRuntimeSeconds` must stay below effective queue `retry_after` with safety margin. - Only covered operation types participate in the generic reconciler for V1. ### OperationRunFreshnessAssessment **Purpose**: Derived classification used by reconcilers and Monitoring surfaces to determine whether a non-terminal run is still trustworthy as active. **Fields**: - `operationRunId` - `status` - `freshnessState` with canonical values `fresh`, `likely_stale`, `terminal`, `unknown` - `evaluatedAt` - `thresholdSeconds` - `evidence` key-value map **Behavior**: - For `queued` runs, freshness is typically derived from `created_at`, absence of `started_at`, and policy threshold. - For `running` runs, freshness is derived from `started_at`, last meaningful update evidence available in persisted state, and policy threshold. - `completed` runs always assess as terminal and are excluded from stale reconciliation. ### LifecycleReconciliationRecord **Purpose**: Structured reconciliation evidence stored in `OperationRun.context` and mirrored in `failure_summary`. **Fields**: - `reconciledAt` - `reconciliationKind` such as `stale_queued`, `stale_running`, `queue_failure_bridge`, `adapter_sync` - `reasonCode` - `reasonMessage` - `evidence` key-value map - `source` such as `failed_callback`, `scheduled_reconciler`, `adapter_reconciler` **Validation rules**: - Must only be added when the feature force-resolves or directly bridges a run. - Must be idempotent; repeat reconciliation must not append conflicting terminal truth. - Must be operator-safe and sanitized. ### OperationQueueFailureBridge **Purpose**: Derived mapping between a queued job failure and the owning `OperationRun`. **Fields**: - `operationRunId` - `jobClass` - `bridgeSource` such as `failed_callback` or `reconciler` - `exceptionClass` - `reasonCode` - `terminalOutcome` **Behavior**: - Exists conceptually as a design contract, not necessarily as a standalone stored table. - Bridges queue truth into service-owned `OperationRun` terminal transitions. ## Supporting Catalogs ### Reconciliation Reason Codes **Purpose**: Stable reason-code catalog for lifecycle healing. **Initial values**: - `run.stale_queued` - `run.stale_running` - `run.infrastructure_timeout_or_abandonment` - `run.queue_failure_bridge` - `run.adapter_out_of_sync` **Validation rules**: - Operator-facing text must be derived from centralized presenters or reason translation helpers. - Codes remain stable enough for regression assertions and audit review. ### Monitoring Freshness State **Purpose**: Derived presentation state for Operations surfaces. **Initial values**: - `fresh_active` - `likely_stale` - `reconciled_failed` - `terminal_normal` **Behavior**: - Not stored as a new top-level database enum in V1. - Derived centrally so tables, detail pages, and notifications do not drift. ## Consumer Mapping | Consumer | Primary data it needs | |---|---| | Generic lifecycle reconciler | Covered operation policy, active non-terminal runs, freshness assessment, standardized reconciliation transition | | Covered queued jobs | Owning `OperationRun`, timeout behavior, direct `failed()` bridge path | | Operations index | Current status, outcome, freshness assessment, reconciliation evidence summary | | Operation run detail | Full reconciliation record, translated reason, run timing, summary counts, failure details | | Runtime invariant validation | Queue connection `retry_after`, effective job timeout, covered operation lifecycle policy | ## Migration Notes - No schema migration is required for the first implementation slice. - Existing `context` and `failure_summary` structures should be normalized for reconciliation evidence rather than replaced. - If later observability needs require indexed reconciliation metrics, a follow-up slice can promote reconciliation metadata into first-class columns or projections.