Ahmed Darrazi 59fc90a4db feat: harden operation lifecycle monitoring

2026-03-23 22:52:37 +01:00

7.8 KiB

Raw Blame History

Phase 1 Data Model: Operation Lifecycle Guarantees & Queue-to-Domain Failure Reconciliation

Overview

This feature does not require a new database table in the first implementation slice. The primary data-model work is the formalization of existing OperationRun persistence plus new derived lifecycle-policy and freshness concepts that make queue truth, domain truth, and operator-visible truth converge deterministically.

Persistent Domain Entities

OperationRun

Purpose: Canonical workspace-scoped operational record for long-running, queued, scheduled, or otherwise operator-visible work.

Key fields:

id
workspace_id
tenant_id nullable
user_id nullable
type
status with canonical values queued, running, completed
outcome with canonical values including pending, succeeded, partially_succeeded, blocked, failed
run_identity_hash
summary_counts JSONB
failure_summary JSONB array
context JSONB
started_at
completed_at
created_at
updated_at

Relationships:

Belongs to one workspace
Optionally belongs to one tenant
Optionally belongs to one initiating user

Validation rules relevant to this feature:

status and outcome transitions remain service-owned via OperationRunService.
Non-terminal active runs remain constrained by the existing active-run unique index semantics.
summary_counts keys remain from the canonical summary key catalog and values remain numeric-only.
Reconciliation metadata must be stored in a standardized structure inside context and failure_summary without persisting secrets.

State transitions relevant to this feature:

queued/pending → running/pending
queued/pending → completed/failed when stale queued reconciliation or direct queue-failure bridging resolves an orphaned queued run
running/pending → completed/succeeded|partially_succeeded|blocked|failed
running/pending → completed/failed when stale running reconciliation resolves an orphaned active run
completed/* is terminal and must never be mutated by reconciliation

Failed Job Record (`failed_jobs`)

Purpose: Infrastructure-level evidence that a queued job exhausted attempts, timed out, or otherwise failed.

Key fields used conceptually:

UUID or failed-job identifier
connection
queue
payload
exception
failed_at

Relationships:

Not directly related through a foreign key to OperationRun
Linked back to OperationRun through job-owned identity resolution or reconciliation evidence

Validation rules relevant to this feature:

A failed-job record is evidence, not operator-facing truth by itself.
Evidence may inform reconciliation or diagnostics but must not replace the domain transition on OperationRun.

Queue Job Definition

Purpose: The queued class that owns or advances a covered OperationRun.

Key lifecycle-relevant properties:

operationRun reference or getOperationRun() contract
optional $timeout
optional $failOnTimeout
optional $tries or retryUntil()
middleware() including TrackOperationRun and other queue middleware
optional failed(Throwable $e) callback

Validation rules relevant to this feature:

Covered jobs must provide a credible path to terminal truth through direct failure bridging, fallback reconciliation, or both.
Covered long-running jobs must have intentional timeout behavior and must be compatible with queue timing invariants.

New Derived Domain Objects

OperationLifecyclePolicy

Purpose: Configuration-backed policy describing which operation types are in scope for V1 and how their lifecycle should be evaluated.

Fields:

operationType
covered boolean
queuedStaleAfterSeconds
runningStaleAfterSeconds
expectedMaxRuntimeSeconds
requiresDirectFailedBridge boolean
supportsReconciliation boolean

Validation rules:

queuedStaleAfterSeconds and runningStaleAfterSeconds must be positive integers.
expectedMaxRuntimeSeconds must stay below effective queue retry_after with safety margin.
Only covered operation types participate in the generic reconciler for V1.

OperationRunFreshnessAssessment

Purpose: Derived classification used by reconcilers and Monitoring surfaces to determine whether a non-terminal run is still trustworthy as active.

Fields:

operationRunId
status
freshnessState with canonical values fresh, likely_stale, terminal, unknown
evaluatedAt
thresholdSeconds
evidence key-value map

Behavior:

For queued runs, freshness is typically derived from created_at, absence of started_at, and policy threshold.
For running runs, freshness is derived from started_at, last meaningful update evidence available in persisted state, and policy threshold.
completed runs always assess as terminal and are excluded from stale reconciliation.

LifecycleReconciliationRecord

Purpose: Structured reconciliation evidence stored in OperationRun.context and mirrored in failure_summary.

Fields:

reconciledAt
reconciliationKind such as stale_queued, stale_running, queue_failure_bridge, adapter_sync
reasonCode
reasonMessage
evidence key-value map
source such as failed_callback, scheduled_reconciler, adapter_reconciler

Validation rules:

Must only be added when the feature force-resolves or directly bridges a run.
Must be idempotent; repeat reconciliation must not append conflicting terminal truth.
Must be operator-safe and sanitized.

OperationQueueFailureBridge

Purpose: Derived mapping between a queued job failure and the owning OperationRun.

Fields:

operationRunId
jobClass
bridgeSource such as failed_callback or reconciler
exceptionClass
reasonCode
terminalOutcome

Behavior:

Exists conceptually as a design contract, not necessarily as a standalone stored table.
Bridges queue truth into service-owned OperationRun terminal transitions.

Supporting Catalogs

Reconciliation Reason Codes

Purpose: Stable reason-code catalog for lifecycle healing.

Initial values:

run.stale_queued
run.stale_running
run.infrastructure_timeout_or_abandonment
run.queue_failure_bridge
run.adapter_out_of_sync

Validation rules:

Operator-facing text must be derived from centralized presenters or reason translation helpers.
Codes remain stable enough for regression assertions and audit review.

Monitoring Freshness State

Purpose: Derived presentation state for Operations surfaces.

Initial values:

fresh_active
likely_stale
reconciled_failed
terminal_normal

Behavior:

Not stored as a new top-level database enum in V1.
Derived centrally so tables, detail pages, and notifications do not drift.

Consumer Mapping

Consumer	Primary data it needs
Generic lifecycle reconciler	Covered operation policy, active non-terminal runs, freshness assessment, standardized reconciliation transition
Covered queued jobs	Owning `OperationRun`, timeout behavior, direct `failed()` bridge path
Operations index	Current status, outcome, freshness assessment, reconciliation evidence summary
Operation run detail	Full reconciliation record, translated reason, run timing, summary counts, failure details
Runtime invariant validation	Queue connection `retry_after`, effective job timeout, covered operation lifecycle policy

Migration Notes

No schema migration is required for the first implementation slice.
Existing context and failure_summary structures should be normalized for reconciliation evidence rather than replaced.
If later observability needs require indexed reconciliation metrics, a follow-up slice can promote reconciliation metadata into first-class columns or projections.

7.8 KiB Raw Blame History

Phase 1 Data Model: Operation Lifecycle Guarantees & Queue-to-Domain Failure Reconciliation

Overview

Persistent Domain Entities

OperationRun

Failed Job Record (failed_jobs)

Queue Job Definition

New Derived Domain Objects

OperationLifecyclePolicy

OperationRunFreshnessAssessment

LifecycleReconciliationRecord

OperationQueueFailureBridge

Supporting Catalogs

Reconciliation Reason Codes

Monitoring Freshness State

Consumer Mapping

Migration Notes

7.8 KiB

Raw Blame History

Failed Job Record (`failed_jobs`)