ahmido 845d21db6d feat: harden operation lifecycle monitoring (#190 )

## Summary
- harden operation-run lifecycle handling with explicit reconciliation policy, stale-run healing, failed-job bridging, and monitoring visibility
- refactor audit log event inspection into a Filament slide-over and remove the stale inline detail/header-action coupling
- align panel theme asset resolution and supporting Filament UI updates, including the rounded 2xl theme token regression fix

## Testing
- ran focused Pest coverage for the affected audit-log inspection flow and related visibility tests
- ran formatting with `vendor/bin/sail bin pint --dirty --format agent`
- manually verified the updated audit-log slide-over flow in the integrated browser

## Notes
- branch includes the Spec 160 artifacts under `specs/160-operation-lifecycle-guarantees/`
- the full test suite was not rerun as part of this final commit/PR step

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #190

2026-03-23 21:53:19 +00:00

4.1 KiB

Raw Blame History

Quickstart: Operation Lifecycle Guarantees & Queue-to-Domain Failure Reconciliation

Goal

Validate that covered queued OperationRun executions always converge to trustworthy terminal truth and that Monitoring surfaces no longer imply indefinite normal activity for orphaned runs.

Prerequisites

Start Sail.
Ensure the queue worker is running through Sail.
Ensure the database contains at least one workspace with operator-visible operation runs for covered types.
Ensure test fixtures or factories can create OperationRun records in queued, running, and completed states.

Implementation Validation Order

1. Run focused lifecycle service tests

vendor/bin/sail artisan test --compact --filter=OperationRunService

Expected outcome:

Stale queued reconciliation still works.
Stale running reconciliation is added and service-owned.
Terminal runs are not mutated.

2. Run focused reconciler tests

vendor/bin/sail artisan test --compact --filter=LifecycleReconciler
vendor/bin/sail artisan test --compact --filter=stale

Expected outcome:

Stale queued runs are force-resolved to completed/failed.
Stale running runs are force-resolved to completed/failed.
Fresh runs remain untouched.
Reconciliation is idempotent across repeated execution.

3. Run focused failed-job bridge tests

vendor/bin/sail artisan test --compact --filter=failed
vendor/bin/sail artisan test --compact --filter=MaxAttempts
vendor/bin/sail artisan test --compact --filter=TimeoutExceeded

Expected outcome:

Covered jobs with direct failed() bridges map queue failure truth back to OperationRun.
Queue failures that never complete normal middleware finalization still converge through reconciliation.

4. Run the Run-126 regression scenario

vendor/bin/sail artisan test --compact --filter=Run126
vendor/bin/sail artisan test --compact --filter=orphaned

Expected outcome:

A run left in running without completeRun() or failRun() is marked terminal failed once the stale threshold is exceeded.
The operator-facing state no longer implies normal active work.

5. Run focused Monitoring UX tests

vendor/bin/sail artisan test --compact tests/Feature/Operations
vendor/bin/sail artisan test --compact --filter=Operations

Expected outcome:

The Operations index distinguishes fresh activity from stale or reconciled failure semantics.
The run detail distinguishes normal failure from reconciled lifecycle failure.
Canonical Monitoring authorization semantics remain intact.

6. Run runtime timing guard tests

vendor/bin/sail artisan test --compact --filter=retry_after
vendor/bin/sail artisan test --compact --filter=timeout

Expected outcome:

Covered lifecycle policy timeouts stay safely below effective retry_after.
Misaligned timing assumptions fail validation instead of remaining implicit.

Runtime notes

Covered lifecycle jobs now declare explicit timeout values and set failOnTimeout = true.
The lifecycle validator expects covered job timeouts and expected runtimes to stay below queue retry_after with a safety margin.
If queue worker settings change during rollout, run vendor/bin/sail artisan queue:restart so workers pick up the new lifecycle contract.
Production and staging stop-wait expectations must stay above the longest covered timeout so workers can exit cleanly instead of orphaning in-flight runs.

7. Manual smoke-check in the browser

Open /admin/operations and inspect a fresh active run.
Inspect a deliberately stale or reconciled run and confirm the list no longer presents it as ordinary in-progress work.
Open /admin/operations/{run} for a reconciled run and confirm the detail page shows operator-safe lifecycle explanation plus secondary diagnostics.
Confirm existing View run navigation remains canonical and no new destructive action is introduced.

Non-Goals For This Slice

No resumable execution or checkpoint recovery.
No queue backend replacement or Horizon adoption.
No new manual retry or re-drive UI.
No new OperationRun status enum.

4.1 KiB Raw Blame History

Quickstart: Operation Lifecycle Guarantees & Queue-to-Domain Failure Reconciliation

Goal

Prerequisites

Implementation Validation Order

1. Run focused lifecycle service tests

2. Run focused reconciler tests

3. Run focused failed-job bridge tests

4. Run the Run-126 regression scenario

5. Run focused Monitoring UX tests

6. Run runtime timing guard tests

Runtime notes

7. Manual smoke-check in the browser

Non-Goals For This Slice

4.1 KiB

Raw Blame History