TenantAtlas/specs/160-operation-lifecycle-guarantees/data-model.md

# Phase 1 Data Model: Operation Lifecycle Guarantees & Queue-to-Domain Failure Reconciliation

## Overview

This feature does not require a new database table in the first implementation slice. The primary data-model work is the formalization of existing `OperationRun` persistence plus new derived lifecycle-policy and freshness concepts that make queue truth, domain truth, and operator-visible truth converge deterministically.

## Persistent Domain Entities

### OperationRun

**Purpose**: Canonical workspace-scoped operational record for long-running, queued, scheduled, or otherwise operator-visible work.

**Key fields**:
- `id`
- `workspace_id`
- `tenant_id` nullable
- `user_id` nullable
- `type`
- `status` with canonical values `queued`, `running`, `completed`
- `outcome` with canonical values including `pending`, `succeeded`, `partially_succeeded`, `blocked`, `failed`
- `run_identity_hash`
- `summary_counts` JSONB
- `failure_summary` JSONB array
- `context` JSONB
- `started_at`
- `completed_at`
- `created_at`
- `updated_at`

**Relationships**:
- Belongs to one workspace
- Optionally belongs to one tenant
- Optionally belongs to one initiating user

**Validation rules relevant to this feature**:
- `status` and `outcome` transitions remain service-owned via `OperationRunService`.
- Non-terminal active runs remain constrained by the existing active-run unique index semantics.
- `summary_counts` keys remain from the canonical summary key catalog and values remain numeric-only.
- Reconciliation metadata must be stored in a standardized structure inside `context` and `failure_summary` without persisting secrets.

**State transitions relevant to this feature**:
- `queued/pending` → `running/pending`
- `queued/pending` → `completed/failed` when stale queued reconciliation or direct queue-failure bridging resolves an orphaned queued run
- `running/pending` → `completed/succeeded|partially_succeeded|blocked|failed`
- `running/pending` → `completed/failed` when stale running reconciliation resolves an orphaned active run
- `completed/*` is terminal and must never be mutated by reconciliation

### Failed Job Record (`failed_jobs`)

**Purpose**: Infrastructure-level evidence that a queued job exhausted attempts, timed out, or otherwise failed.

**Key fields used conceptually**:
- UUID or failed-job identifier
- connection
- queue
- payload
- exception
- failed_at

**Relationships**:
- Not directly related through a foreign key to `OperationRun`
- Linked back to `OperationRun` through job-owned identity resolution or reconciliation evidence

**Validation rules relevant to this feature**:
- A failed-job record is evidence, not operator-facing truth by itself.
- Evidence may inform reconciliation or diagnostics but must not replace the domain transition on `OperationRun`.

### Queue Job Definition

**Purpose**: The queued class that owns or advances a covered `OperationRun`.

**Key lifecycle-relevant properties**:
- `operationRun` reference or `getOperationRun()` contract
- optional `$timeout`
- optional `$failOnTimeout`
- optional `$tries` or `retryUntil()`
- `middleware()` including `TrackOperationRun` and other queue middleware
- optional `failed(Throwable $e)` callback

**Validation rules relevant to this feature**:
- Covered jobs must provide a credible path to terminal truth through direct failure bridging, fallback reconciliation, or both.
- Covered long-running jobs must have intentional timeout behavior and must be compatible with queue timing invariants.

## New Derived Domain Objects

### OperationLifecyclePolicy

**Purpose**: Configuration-backed policy describing which operation types are in scope for V1 and how their lifecycle should be evaluated.

**Fields**:
- `operationType`
- `covered` boolean
- `queuedStaleAfterSeconds`
- `runningStaleAfterSeconds`
- `expectedMaxRuntimeSeconds`
- `requiresDirectFailedBridge` boolean
- `supportsReconciliation` boolean

**Validation rules**:
- `queuedStaleAfterSeconds` and `runningStaleAfterSeconds` must be positive integers.
- `expectedMaxRuntimeSeconds` must stay below effective queue `retry_after` with safety margin.
- Only covered operation types participate in the generic reconciler for V1.

### OperationRunFreshnessAssessment

**Purpose**: Derived classification used by reconcilers and Monitoring surfaces to determine whether a non-terminal run is still trustworthy as active.

**Fields**:
- `operationRunId`
- `status`
- `freshnessState` with canonical values `fresh`, `likely_stale`, `terminal`, `unknown`
- `evaluatedAt`
- `thresholdSeconds`
- `evidence` key-value map

**Behavior**:
- For `queued` runs, freshness is typically derived from `created_at`, absence of `started_at`, and policy threshold.
- For `running` runs, freshness is derived from `started_at`, last meaningful update evidence available in persisted state, and policy threshold.
- `completed` runs always assess as terminal and are excluded from stale reconciliation.

### LifecycleReconciliationRecord

**Purpose**: Structured reconciliation evidence stored in `OperationRun.context` and mirrored in `failure_summary`.

**Fields**:
- `reconciledAt`
- `reconciliationKind` such as `stale_queued`, `stale_running`, `queue_failure_bridge`, `adapter_sync`
- `reasonCode`
- `reasonMessage`
- `evidence` key-value map
- `source` such as `failed_callback`, `scheduled_reconciler`, `adapter_reconciler`

**Validation rules**:
- Must only be added when the feature force-resolves or directly bridges a run.
- Must be idempotent; repeat reconciliation must not append conflicting terminal truth.
- Must be operator-safe and sanitized.

### OperationQueueFailureBridge

**Purpose**: Derived mapping between a queued job failure and the owning `OperationRun`.

**Fields**:
- `operationRunId`
- `jobClass`
- `bridgeSource` such as `failed_callback` or `reconciler`
- `exceptionClass`
- `reasonCode`
- `terminalOutcome`

**Behavior**:
- Exists conceptually as a design contract, not necessarily as a standalone stored table.
- Bridges queue truth into service-owned `OperationRun` terminal transitions.

## Supporting Catalogs

### Reconciliation Reason Codes

**Purpose**: Stable reason-code catalog for lifecycle healing.

**Initial values**:
- `run.stale_queued`
- `run.stale_running`
- `run.infrastructure_timeout_or_abandonment`
- `run.queue_failure_bridge`
- `run.adapter_out_of_sync`

**Validation rules**:
- Operator-facing text must be derived from centralized presenters or reason translation helpers.
- Codes remain stable enough for regression assertions and audit review.

### Monitoring Freshness State

**Purpose**: Derived presentation state for Operations surfaces.

**Initial values**:
- `fresh_active`
- `likely_stale`
- `reconciled_failed`
- `terminal_normal`

**Behavior**:
- Not stored as a new top-level database enum in V1.
- Derived centrally so tables, detail pages, and notifications do not drift.

## Consumer Mapping

| Consumer | Primary data it needs |
|---|---|
| Generic lifecycle reconciler | Covered operation policy, active non-terminal runs, freshness assessment, standardized reconciliation transition |
| Covered queued jobs | Owning `OperationRun`, timeout behavior, direct `failed()` bridge path |
| Operations index | Current status, outcome, freshness assessment, reconciliation evidence summary |
| Operation run detail | Full reconciliation record, translated reason, run timing, summary counts, failure details |
| Runtime invariant validation | Queue connection `retry_after`, effective job timeout, covered operation lifecycle policy |

## Migration Notes

- No schema migration is required for the first implementation slice.
- Existing `context` and `failure_summary` structures should be normalized for reconciliation evidence rather than replaced.
- If later observability needs require indexed reconciliation metrics, a follow-up slice can promote reconciliation metadata into first-class columns or projections.