TenantAtlas/specs/160-operation-lifecycle-guarantees/data-model.md
2026-03-23 22:52:37 +01:00

200 lines
7.8 KiB
Markdown

# Phase 1 Data Model: Operation Lifecycle Guarantees & Queue-to-Domain Failure Reconciliation
## Overview
This feature does not require a new database table in the first implementation slice. The primary data-model work is the formalization of existing `OperationRun` persistence plus new derived lifecycle-policy and freshness concepts that make queue truth, domain truth, and operator-visible truth converge deterministically.
## Persistent Domain Entities
### OperationRun
**Purpose**: Canonical workspace-scoped operational record for long-running, queued, scheduled, or otherwise operator-visible work.
**Key fields**:
- `id`
- `workspace_id`
- `tenant_id` nullable
- `user_id` nullable
- `type`
- `status` with canonical values `queued`, `running`, `completed`
- `outcome` with canonical values including `pending`, `succeeded`, `partially_succeeded`, `blocked`, `failed`
- `run_identity_hash`
- `summary_counts` JSONB
- `failure_summary` JSONB array
- `context` JSONB
- `started_at`
- `completed_at`
- `created_at`
- `updated_at`
**Relationships**:
- Belongs to one workspace
- Optionally belongs to one tenant
- Optionally belongs to one initiating user
**Validation rules relevant to this feature**:
- `status` and `outcome` transitions remain service-owned via `OperationRunService`.
- Non-terminal active runs remain constrained by the existing active-run unique index semantics.
- `summary_counts` keys remain from the canonical summary key catalog and values remain numeric-only.
- Reconciliation metadata must be stored in a standardized structure inside `context` and `failure_summary` without persisting secrets.
**State transitions relevant to this feature**:
- `queued/pending``running/pending`
- `queued/pending``completed/failed` when stale queued reconciliation or direct queue-failure bridging resolves an orphaned queued run
- `running/pending``completed/succeeded|partially_succeeded|blocked|failed`
- `running/pending``completed/failed` when stale running reconciliation resolves an orphaned active run
- `completed/*` is terminal and must never be mutated by reconciliation
### Failed Job Record (`failed_jobs`)
**Purpose**: Infrastructure-level evidence that a queued job exhausted attempts, timed out, or otherwise failed.
**Key fields used conceptually**:
- UUID or failed-job identifier
- connection
- queue
- payload
- exception
- failed_at
**Relationships**:
- Not directly related through a foreign key to `OperationRun`
- Linked back to `OperationRun` through job-owned identity resolution or reconciliation evidence
**Validation rules relevant to this feature**:
- A failed-job record is evidence, not operator-facing truth by itself.
- Evidence may inform reconciliation or diagnostics but must not replace the domain transition on `OperationRun`.
### Queue Job Definition
**Purpose**: The queued class that owns or advances a covered `OperationRun`.
**Key lifecycle-relevant properties**:
- `operationRun` reference or `getOperationRun()` contract
- optional `$timeout`
- optional `$failOnTimeout`
- optional `$tries` or `retryUntil()`
- `middleware()` including `TrackOperationRun` and other queue middleware
- optional `failed(Throwable $e)` callback
**Validation rules relevant to this feature**:
- Covered jobs must provide a credible path to terminal truth through direct failure bridging, fallback reconciliation, or both.
- Covered long-running jobs must have intentional timeout behavior and must be compatible with queue timing invariants.
## New Derived Domain Objects
### OperationLifecyclePolicy
**Purpose**: Configuration-backed policy describing which operation types are in scope for V1 and how their lifecycle should be evaluated.
**Fields**:
- `operationType`
- `covered` boolean
- `queuedStaleAfterSeconds`
- `runningStaleAfterSeconds`
- `expectedMaxRuntimeSeconds`
- `requiresDirectFailedBridge` boolean
- `supportsReconciliation` boolean
**Validation rules**:
- `queuedStaleAfterSeconds` and `runningStaleAfterSeconds` must be positive integers.
- `expectedMaxRuntimeSeconds` must stay below effective queue `retry_after` with safety margin.
- Only covered operation types participate in the generic reconciler for V1.
### OperationRunFreshnessAssessment
**Purpose**: Derived classification used by reconcilers and Monitoring surfaces to determine whether a non-terminal run is still trustworthy as active.
**Fields**:
- `operationRunId`
- `status`
- `freshnessState` with canonical values `fresh`, `likely_stale`, `terminal`, `unknown`
- `evaluatedAt`
- `thresholdSeconds`
- `evidence` key-value map
**Behavior**:
- For `queued` runs, freshness is typically derived from `created_at`, absence of `started_at`, and policy threshold.
- For `running` runs, freshness is derived from `started_at`, last meaningful update evidence available in persisted state, and policy threshold.
- `completed` runs always assess as terminal and are excluded from stale reconciliation.
### LifecycleReconciliationRecord
**Purpose**: Structured reconciliation evidence stored in `OperationRun.context` and mirrored in `failure_summary`.
**Fields**:
- `reconciledAt`
- `reconciliationKind` such as `stale_queued`, `stale_running`, `queue_failure_bridge`, `adapter_sync`
- `reasonCode`
- `reasonMessage`
- `evidence` key-value map
- `source` such as `failed_callback`, `scheduled_reconciler`, `adapter_reconciler`
**Validation rules**:
- Must only be added when the feature force-resolves or directly bridges a run.
- Must be idempotent; repeat reconciliation must not append conflicting terminal truth.
- Must be operator-safe and sanitized.
### OperationQueueFailureBridge
**Purpose**: Derived mapping between a queued job failure and the owning `OperationRun`.
**Fields**:
- `operationRunId`
- `jobClass`
- `bridgeSource` such as `failed_callback` or `reconciler`
- `exceptionClass`
- `reasonCode`
- `terminalOutcome`
**Behavior**:
- Exists conceptually as a design contract, not necessarily as a standalone stored table.
- Bridges queue truth into service-owned `OperationRun` terminal transitions.
## Supporting Catalogs
### Reconciliation Reason Codes
**Purpose**: Stable reason-code catalog for lifecycle healing.
**Initial values**:
- `run.stale_queued`
- `run.stale_running`
- `run.infrastructure_timeout_or_abandonment`
- `run.queue_failure_bridge`
- `run.adapter_out_of_sync`
**Validation rules**:
- Operator-facing text must be derived from centralized presenters or reason translation helpers.
- Codes remain stable enough for regression assertions and audit review.
### Monitoring Freshness State
**Purpose**: Derived presentation state for Operations surfaces.
**Initial values**:
- `fresh_active`
- `likely_stale`
- `reconciled_failed`
- `terminal_normal`
**Behavior**:
- Not stored as a new top-level database enum in V1.
- Derived centrally so tables, detail pages, and notifications do not drift.
## Consumer Mapping
| Consumer | Primary data it needs |
|---|---|
| Generic lifecycle reconciler | Covered operation policy, active non-terminal runs, freshness assessment, standardized reconciliation transition |
| Covered queued jobs | Owning `OperationRun`, timeout behavior, direct `failed()` bridge path |
| Operations index | Current status, outcome, freshness assessment, reconciliation evidence summary |
| Operation run detail | Full reconciliation record, translated reason, run timing, summary counts, failure details |
| Runtime invariant validation | Queue connection `retry_after`, effective job timeout, covered operation lifecycle policy |
## Migration Notes
- No schema migration is required for the first implementation slice.
- Existing `context` and `failure_summary` structures should be normalized for reconciliation evidence rather than replaced.
- If later observability needs require indexed reconciliation metrics, a follow-up slice can promote reconciliation metadata into first-class columns or projections.