TenantAtlas/specs/160-operation-lifecycle-guarantees/data-model.md
2026-03-23 22:52:37 +01:00

7.8 KiB

Phase 1 Data Model: Operation Lifecycle Guarantees & Queue-to-Domain Failure Reconciliation

Overview

This feature does not require a new database table in the first implementation slice. The primary data-model work is the formalization of existing OperationRun persistence plus new derived lifecycle-policy and freshness concepts that make queue truth, domain truth, and operator-visible truth converge deterministically.

Persistent Domain Entities

OperationRun

Purpose: Canonical workspace-scoped operational record for long-running, queued, scheduled, or otherwise operator-visible work.

Key fields:

  • id
  • workspace_id
  • tenant_id nullable
  • user_id nullable
  • type
  • status with canonical values queued, running, completed
  • outcome with canonical values including pending, succeeded, partially_succeeded, blocked, failed
  • run_identity_hash
  • summary_counts JSONB
  • failure_summary JSONB array
  • context JSONB
  • started_at
  • completed_at
  • created_at
  • updated_at

Relationships:

  • Belongs to one workspace
  • Optionally belongs to one tenant
  • Optionally belongs to one initiating user

Validation rules relevant to this feature:

  • status and outcome transitions remain service-owned via OperationRunService.
  • Non-terminal active runs remain constrained by the existing active-run unique index semantics.
  • summary_counts keys remain from the canonical summary key catalog and values remain numeric-only.
  • Reconciliation metadata must be stored in a standardized structure inside context and failure_summary without persisting secrets.

State transitions relevant to this feature:

  • queued/pendingrunning/pending
  • queued/pendingcompleted/failed when stale queued reconciliation or direct queue-failure bridging resolves an orphaned queued run
  • running/pendingcompleted/succeeded|partially_succeeded|blocked|failed
  • running/pendingcompleted/failed when stale running reconciliation resolves an orphaned active run
  • completed/* is terminal and must never be mutated by reconciliation

Failed Job Record (failed_jobs)

Purpose: Infrastructure-level evidence that a queued job exhausted attempts, timed out, or otherwise failed.

Key fields used conceptually:

  • UUID or failed-job identifier
  • connection
  • queue
  • payload
  • exception
  • failed_at

Relationships:

  • Not directly related through a foreign key to OperationRun
  • Linked back to OperationRun through job-owned identity resolution or reconciliation evidence

Validation rules relevant to this feature:

  • A failed-job record is evidence, not operator-facing truth by itself.
  • Evidence may inform reconciliation or diagnostics but must not replace the domain transition on OperationRun.

Queue Job Definition

Purpose: The queued class that owns or advances a covered OperationRun.

Key lifecycle-relevant properties:

  • operationRun reference or getOperationRun() contract
  • optional $timeout
  • optional $failOnTimeout
  • optional $tries or retryUntil()
  • middleware() including TrackOperationRun and other queue middleware
  • optional failed(Throwable $e) callback

Validation rules relevant to this feature:

  • Covered jobs must provide a credible path to terminal truth through direct failure bridging, fallback reconciliation, or both.
  • Covered long-running jobs must have intentional timeout behavior and must be compatible with queue timing invariants.

New Derived Domain Objects

OperationLifecyclePolicy

Purpose: Configuration-backed policy describing which operation types are in scope for V1 and how their lifecycle should be evaluated.

Fields:

  • operationType
  • covered boolean
  • queuedStaleAfterSeconds
  • runningStaleAfterSeconds
  • expectedMaxRuntimeSeconds
  • requiresDirectFailedBridge boolean
  • supportsReconciliation boolean

Validation rules:

  • queuedStaleAfterSeconds and runningStaleAfterSeconds must be positive integers.
  • expectedMaxRuntimeSeconds must stay below effective queue retry_after with safety margin.
  • Only covered operation types participate in the generic reconciler for V1.

OperationRunFreshnessAssessment

Purpose: Derived classification used by reconcilers and Monitoring surfaces to determine whether a non-terminal run is still trustworthy as active.

Fields:

  • operationRunId
  • status
  • freshnessState with canonical values fresh, likely_stale, terminal, unknown
  • evaluatedAt
  • thresholdSeconds
  • evidence key-value map

Behavior:

  • For queued runs, freshness is typically derived from created_at, absence of started_at, and policy threshold.
  • For running runs, freshness is derived from started_at, last meaningful update evidence available in persisted state, and policy threshold.
  • completed runs always assess as terminal and are excluded from stale reconciliation.

LifecycleReconciliationRecord

Purpose: Structured reconciliation evidence stored in OperationRun.context and mirrored in failure_summary.

Fields:

  • reconciledAt
  • reconciliationKind such as stale_queued, stale_running, queue_failure_bridge, adapter_sync
  • reasonCode
  • reasonMessage
  • evidence key-value map
  • source such as failed_callback, scheduled_reconciler, adapter_reconciler

Validation rules:

  • Must only be added when the feature force-resolves or directly bridges a run.
  • Must be idempotent; repeat reconciliation must not append conflicting terminal truth.
  • Must be operator-safe and sanitized.

OperationQueueFailureBridge

Purpose: Derived mapping between a queued job failure and the owning OperationRun.

Fields:

  • operationRunId
  • jobClass
  • bridgeSource such as failed_callback or reconciler
  • exceptionClass
  • reasonCode
  • terminalOutcome

Behavior:

  • Exists conceptually as a design contract, not necessarily as a standalone stored table.
  • Bridges queue truth into service-owned OperationRun terminal transitions.

Supporting Catalogs

Reconciliation Reason Codes

Purpose: Stable reason-code catalog for lifecycle healing.

Initial values:

  • run.stale_queued
  • run.stale_running
  • run.infrastructure_timeout_or_abandonment
  • run.queue_failure_bridge
  • run.adapter_out_of_sync

Validation rules:

  • Operator-facing text must be derived from centralized presenters or reason translation helpers.
  • Codes remain stable enough for regression assertions and audit review.

Monitoring Freshness State

Purpose: Derived presentation state for Operations surfaces.

Initial values:

  • fresh_active
  • likely_stale
  • reconciled_failed
  • terminal_normal

Behavior:

  • Not stored as a new top-level database enum in V1.
  • Derived centrally so tables, detail pages, and notifications do not drift.

Consumer Mapping

Consumer Primary data it needs
Generic lifecycle reconciler Covered operation policy, active non-terminal runs, freshness assessment, standardized reconciliation transition
Covered queued jobs Owning OperationRun, timeout behavior, direct failed() bridge path
Operations index Current status, outcome, freshness assessment, reconciliation evidence summary
Operation run detail Full reconciliation record, translated reason, run timing, summary counts, failure details
Runtime invariant validation Queue connection retry_after, effective job timeout, covered operation lifecycle policy

Migration Notes

  • No schema migration is required for the first implementation slice.
  • Existing context and failure_summary structures should be normalized for reconciliation evidence rather than replaced.
  • If later observability needs require indexed reconciliation metrics, a follow-up slice can promote reconciliation metadata into first-class columns or projections.