200 lines
7.8 KiB
Markdown
200 lines
7.8 KiB
Markdown
# Phase 1 Data Model: Operation Lifecycle Guarantees & Queue-to-Domain Failure Reconciliation
|
|
|
|
## Overview
|
|
|
|
This feature does not require a new database table in the first implementation slice. The primary data-model work is the formalization of existing `OperationRun` persistence plus new derived lifecycle-policy and freshness concepts that make queue truth, domain truth, and operator-visible truth converge deterministically.
|
|
|
|
## Persistent Domain Entities
|
|
|
|
### OperationRun
|
|
|
|
**Purpose**: Canonical workspace-scoped operational record for long-running, queued, scheduled, or otherwise operator-visible work.
|
|
|
|
**Key fields**:
|
|
- `id`
|
|
- `workspace_id`
|
|
- `tenant_id` nullable
|
|
- `user_id` nullable
|
|
- `type`
|
|
- `status` with canonical values `queued`, `running`, `completed`
|
|
- `outcome` with canonical values including `pending`, `succeeded`, `partially_succeeded`, `blocked`, `failed`
|
|
- `run_identity_hash`
|
|
- `summary_counts` JSONB
|
|
- `failure_summary` JSONB array
|
|
- `context` JSONB
|
|
- `started_at`
|
|
- `completed_at`
|
|
- `created_at`
|
|
- `updated_at`
|
|
|
|
**Relationships**:
|
|
- Belongs to one workspace
|
|
- Optionally belongs to one tenant
|
|
- Optionally belongs to one initiating user
|
|
|
|
**Validation rules relevant to this feature**:
|
|
- `status` and `outcome` transitions remain service-owned via `OperationRunService`.
|
|
- Non-terminal active runs remain constrained by the existing active-run unique index semantics.
|
|
- `summary_counts` keys remain from the canonical summary key catalog and values remain numeric-only.
|
|
- Reconciliation metadata must be stored in a standardized structure inside `context` and `failure_summary` without persisting secrets.
|
|
|
|
**State transitions relevant to this feature**:
|
|
- `queued/pending` → `running/pending`
|
|
- `queued/pending` → `completed/failed` when stale queued reconciliation or direct queue-failure bridging resolves an orphaned queued run
|
|
- `running/pending` → `completed/succeeded|partially_succeeded|blocked|failed`
|
|
- `running/pending` → `completed/failed` when stale running reconciliation resolves an orphaned active run
|
|
- `completed/*` is terminal and must never be mutated by reconciliation
|
|
|
|
### Failed Job Record (`failed_jobs`)
|
|
|
|
**Purpose**: Infrastructure-level evidence that a queued job exhausted attempts, timed out, or otherwise failed.
|
|
|
|
**Key fields used conceptually**:
|
|
- UUID or failed-job identifier
|
|
- connection
|
|
- queue
|
|
- payload
|
|
- exception
|
|
- failed_at
|
|
|
|
**Relationships**:
|
|
- Not directly related through a foreign key to `OperationRun`
|
|
- Linked back to `OperationRun` through job-owned identity resolution or reconciliation evidence
|
|
|
|
**Validation rules relevant to this feature**:
|
|
- A failed-job record is evidence, not operator-facing truth by itself.
|
|
- Evidence may inform reconciliation or diagnostics but must not replace the domain transition on `OperationRun`.
|
|
|
|
### Queue Job Definition
|
|
|
|
**Purpose**: The queued class that owns or advances a covered `OperationRun`.
|
|
|
|
**Key lifecycle-relevant properties**:
|
|
- `operationRun` reference or `getOperationRun()` contract
|
|
- optional `$timeout`
|
|
- optional `$failOnTimeout`
|
|
- optional `$tries` or `retryUntil()`
|
|
- `middleware()` including `TrackOperationRun` and other queue middleware
|
|
- optional `failed(Throwable $e)` callback
|
|
|
|
**Validation rules relevant to this feature**:
|
|
- Covered jobs must provide a credible path to terminal truth through direct failure bridging, fallback reconciliation, or both.
|
|
- Covered long-running jobs must have intentional timeout behavior and must be compatible with queue timing invariants.
|
|
|
|
## New Derived Domain Objects
|
|
|
|
### OperationLifecyclePolicy
|
|
|
|
**Purpose**: Configuration-backed policy describing which operation types are in scope for V1 and how their lifecycle should be evaluated.
|
|
|
|
**Fields**:
|
|
- `operationType`
|
|
- `covered` boolean
|
|
- `queuedStaleAfterSeconds`
|
|
- `runningStaleAfterSeconds`
|
|
- `expectedMaxRuntimeSeconds`
|
|
- `requiresDirectFailedBridge` boolean
|
|
- `supportsReconciliation` boolean
|
|
|
|
**Validation rules**:
|
|
- `queuedStaleAfterSeconds` and `runningStaleAfterSeconds` must be positive integers.
|
|
- `expectedMaxRuntimeSeconds` must stay below effective queue `retry_after` with safety margin.
|
|
- Only covered operation types participate in the generic reconciler for V1.
|
|
|
|
### OperationRunFreshnessAssessment
|
|
|
|
**Purpose**: Derived classification used by reconcilers and Monitoring surfaces to determine whether a non-terminal run is still trustworthy as active.
|
|
|
|
**Fields**:
|
|
- `operationRunId`
|
|
- `status`
|
|
- `freshnessState` with canonical values `fresh`, `likely_stale`, `terminal`, `unknown`
|
|
- `evaluatedAt`
|
|
- `thresholdSeconds`
|
|
- `evidence` key-value map
|
|
|
|
**Behavior**:
|
|
- For `queued` runs, freshness is typically derived from `created_at`, absence of `started_at`, and policy threshold.
|
|
- For `running` runs, freshness is derived from `started_at`, last meaningful update evidence available in persisted state, and policy threshold.
|
|
- `completed` runs always assess as terminal and are excluded from stale reconciliation.
|
|
|
|
### LifecycleReconciliationRecord
|
|
|
|
**Purpose**: Structured reconciliation evidence stored in `OperationRun.context` and mirrored in `failure_summary`.
|
|
|
|
**Fields**:
|
|
- `reconciledAt`
|
|
- `reconciliationKind` such as `stale_queued`, `stale_running`, `queue_failure_bridge`, `adapter_sync`
|
|
- `reasonCode`
|
|
- `reasonMessage`
|
|
- `evidence` key-value map
|
|
- `source` such as `failed_callback`, `scheduled_reconciler`, `adapter_reconciler`
|
|
|
|
**Validation rules**:
|
|
- Must only be added when the feature force-resolves or directly bridges a run.
|
|
- Must be idempotent; repeat reconciliation must not append conflicting terminal truth.
|
|
- Must be operator-safe and sanitized.
|
|
|
|
### OperationQueueFailureBridge
|
|
|
|
**Purpose**: Derived mapping between a queued job failure and the owning `OperationRun`.
|
|
|
|
**Fields**:
|
|
- `operationRunId`
|
|
- `jobClass`
|
|
- `bridgeSource` such as `failed_callback` or `reconciler`
|
|
- `exceptionClass`
|
|
- `reasonCode`
|
|
- `terminalOutcome`
|
|
|
|
**Behavior**:
|
|
- Exists conceptually as a design contract, not necessarily as a standalone stored table.
|
|
- Bridges queue truth into service-owned `OperationRun` terminal transitions.
|
|
|
|
## Supporting Catalogs
|
|
|
|
### Reconciliation Reason Codes
|
|
|
|
**Purpose**: Stable reason-code catalog for lifecycle healing.
|
|
|
|
**Initial values**:
|
|
- `run.stale_queued`
|
|
- `run.stale_running`
|
|
- `run.infrastructure_timeout_or_abandonment`
|
|
- `run.queue_failure_bridge`
|
|
- `run.adapter_out_of_sync`
|
|
|
|
**Validation rules**:
|
|
- Operator-facing text must be derived from centralized presenters or reason translation helpers.
|
|
- Codes remain stable enough for regression assertions and audit review.
|
|
|
|
### Monitoring Freshness State
|
|
|
|
**Purpose**: Derived presentation state for Operations surfaces.
|
|
|
|
**Initial values**:
|
|
- `fresh_active`
|
|
- `likely_stale`
|
|
- `reconciled_failed`
|
|
- `terminal_normal`
|
|
|
|
**Behavior**:
|
|
- Not stored as a new top-level database enum in V1.
|
|
- Derived centrally so tables, detail pages, and notifications do not drift.
|
|
|
|
## Consumer Mapping
|
|
|
|
| Consumer | Primary data it needs |
|
|
|---|---|
|
|
| Generic lifecycle reconciler | Covered operation policy, active non-terminal runs, freshness assessment, standardized reconciliation transition |
|
|
| Covered queued jobs | Owning `OperationRun`, timeout behavior, direct `failed()` bridge path |
|
|
| Operations index | Current status, outcome, freshness assessment, reconciliation evidence summary |
|
|
| Operation run detail | Full reconciliation record, translated reason, run timing, summary counts, failure details |
|
|
| Runtime invariant validation | Queue connection `retry_after`, effective job timeout, covered operation lifecycle policy |
|
|
|
|
## Migration Notes
|
|
|
|
- No schema migration is required for the first implementation slice.
|
|
- Existing `context` and `failure_summary` structures should be normalized for reconciliation evidence rather than replaced.
|
|
- If later observability needs require indexed reconciliation metrics, a follow-up slice can promote reconciliation metadata into first-class columns or projections.
|