ahmido 845d21db6d feat: harden operation lifecycle monitoring (#190 )

## Summary
- harden operation-run lifecycle handling with explicit reconciliation policy, stale-run healing, failed-job bridging, and monitoring visibility
- refactor audit log event inspection into a Filament slide-over and remove the stale inline detail/header-action coupling
- align panel theme asset resolution and supporting Filament UI updates, including the rounded 2xl theme token regression fix

## Testing
- ran focused Pest coverage for the affected audit-log inspection flow and related visibility tests
- ran formatting with `vendor/bin/sail bin pint --dirty --format agent`
- manually verified the updated audit-log slide-over flow in the integrated browser

## Notes
- branch includes the Spec 160 artifacts under `specs/160-operation-lifecycle-guarantees/`
- the full test suite was not rerun as part of this final commit/PR step

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #190

2026-03-23 21:53:19 +00:00

6.9 KiB

Raw Blame History

Phase 0 Research: Operation Lifecycle Guarantees & Queue-to-Domain Failure Reconciliation

Decision: Extend the existing `OperationRunService` and reconciliation seams instead of creating a new orchestration subsystem

Rationale: The repo already treats OperationRunService as the canonical owner of lifecycle transitions and already contains two partial healing patterns: stale queued reconciliation in OperationRunService, plus type-specific reconciliation for backup schedule runs and restore adapter-backed runs. Extending these seams keeps lifecycle truth service-owned, avoids a second state machine, and aligns with the constitution rule that OperationRun.status and OperationRun.outcome transitions remain centralized.

Alternatives considered:

Create a new orchestration or workflow engine for all queued operations: rejected because Spec 160 is a reliability-hardening feature, not a queue-platform redesign.
Let each operation type keep bespoke reconcile code forever: rejected because the current type-by-type pattern already left major gaps, including the Run-126 class of failure.

Decision: Use a layered truth-bridge strategy: direct `failed()` callbacks for covered jobs plus scheduled stale-run reconciliation as the safety net

Rationale: Laravel queue documentation confirms that failed() is invoked for jobs that exhaust attempts or time out, even when the final exception is MaxAttemptsExceededException or TimeoutExceededException. That makes job-owned failed() bridging the cleanest direct path for covered jobs that can resolve their owning OperationRun. However, process death, worker kill, or other infrastructure interruption can still prevent any callback from running. Scheduled stale-run reconciliation is therefore still required as the final guarantee.

Alternatives considered:

Rely only on TrackOperationRun: rejected because middleware cannot handle failures that occur before the middleware pipeline is entered or after the infrastructure already declared the job failed.
Rely only on stale-run reconciliation: rejected because direct failure bridging gives faster truth convergence and better structured failure reasons when the queue does provide a terminal failure callback.
Parse failed_jobs records as the primary bridge: rejected for V1 because payload parsing and job-class introspection are more brittle than job-owned identity plus domain reconciliation.

Decision: Preserve the existing `queued` / `running` / `completed` lifecycle model and add derived freshness semantics instead of new top-level statuses

Rationale: The current domain model already separates lifecycle state from execution outcome. The real gap is not missing statuses but missing convergence and missing freshness interpretation. Preserving the existing model avoids broad downstream breakage while still allowing the UI and contracts to distinguish fresh active work, likely stale work, and reconciled failure through centralized presenters and reason codes.

Alternatives considered:

Add new top-level statuses such as stale or reconciled: rejected because this would spread persistence and presentation changes across the codebase for limited benefit.
Leave stale interpretation implicit: rejected because operators need explicit liveness truth and tests need stable semantics.

Decision: Define V1 coverage and stale thresholds through configuration-backed lifecycle policy, not ad hoc hardcoded checks

Rationale: The repo already has one hardcoded stale-queued default and one type-specific backup-schedule reconcile command. Spec 160 needs explicit V1 coverage, status-specific thresholds, and timing expectations across multiple run types. Configuration-backed lifecycle policy keeps the first slice schema-light, auditable, and easier to validate in tests while preventing logic from being scattered across multiple jobs and commands.

Alternatives considered:

Keep a single global stale threshold for every operation type: rejected because long-running baseline or report jobs and shorter sync jobs have different legitimate runtime envelopes.
Store lifecycle policy in a new database table: rejected for V1 because rollout speed and deterministic config review matter more than runtime mutability.

Decision: Store reconciliation evidence in existing `context` and `failure_summary` structures for the first slice

Rationale: OperationRun already stores structured JSONB context and failure arrays, and existing triage flows already record structured metadata under context['triage']. Reusing these structures keeps the first slice migration-free while still allowing operator-safe explanation, auditability, and later observability extraction. The key requirement is to standardize the reconciliation metadata shape and reason codes.

Alternatives considered:

Add top-level reconciled_at and reconciliation_reason columns immediately: rejected for V1 because the repo already has a structured metadata pattern that can support the feature without schema churn.
Store only a free-text failure message: rejected because stable reason codes and timestamps are required for auditability and future metrics.

Decision: Reuse centralized Ops-UX and badge presentation seams for stale and reconciled semantics

Rationale: The repo already centralizes run-facing language through OperationUxPresenter, ReasonPresenter, and badge domain helpers for status and outcome. Extending those seams keeps stale or reconciled semantics consistent across the Operations index, run detail, and notifications while honoring BADGE-001 and UI-NAMING-001. It also avoids ad hoc table-only mappings that would drift.

Alternatives considered:

Add custom inline labels only on the Operations table: rejected because run detail, widgets, and notifications would then drift semantically.
Surface low-level queue exceptions directly in primary badges: rejected because operator-facing copy must stay domain-safe and infrastructure details must remain diagnostics-only.

Decision: Treat queue timing alignment as product correctness and validate it explicitly

Rationale: Current queue connections use retry_after = 600, while at least some covered jobs explicitly set $timeout = 300. Laravel documentation is explicit that job timeout must remain shorter than retry_after; otherwise, the same job may be retried before the worker times out. Spec 160 is driven by exactly this truth-divergence problem, so timing alignment must become a documented and testable lifecycle invariant rather than an informal deployment note.

Alternatives considered:

Document timing rules only in deployment notes: rejected because silent drift in job or worker settings would recreate the same incident class.
Validate only global worker timeout and ignore per-job timeout: rejected because covered jobs already override timeout values and the invariant has to hold at the job-policy level.

6.9 KiB Raw Blame History

Phase 0 Research: Operation Lifecycle Guarantees & Queue-to-Domain Failure Reconciliation

Decision: Extend the existing OperationRunService and reconciliation seams instead of creating a new orchestration subsystem

Decision: Use a layered truth-bridge strategy: direct failed() callbacks for covered jobs plus scheduled stale-run reconciliation as the safety net

Decision: Preserve the existing queued / running / completed lifecycle model and add derived freshness semantics instead of new top-level statuses

Decision: Define V1 coverage and stale thresholds through configuration-backed lifecycle policy, not ad hoc hardcoded checks

Decision: Store reconciliation evidence in existing context and failure_summary structures for the first slice

Decision: Reuse centralized Ops-UX and badge presentation seams for stale and reconciled semantics

Decision: Treat queue timing alignment as product correctness and validate it explicitly

6.9 KiB

Raw Blame History

Decision: Extend the existing `OperationRunService` and reconciliation seams instead of creating a new orchestration subsystem

Decision: Use a layered truth-bridge strategy: direct `failed()` callbacks for covered jobs plus scheduled stale-run reconciliation as the safety net

Decision: Preserve the existing `queued` / `running` / `completed` lifecycle model and add derived freshness semantics instead of new top-level statuses

Decision: Store reconciliation evidence in existing `context` and `failure_summary` structures for the first slice