TenantAtlas/specs/160-operation-lifecycle-guarantees/quickstart.md
2026-03-23 22:52:37 +01:00

4.1 KiB

Quickstart: Operation Lifecycle Guarantees & Queue-to-Domain Failure Reconciliation

Goal

Validate that covered queued OperationRun executions always converge to trustworthy terminal truth and that Monitoring surfaces no longer imply indefinite normal activity for orphaned runs.

Prerequisites

  1. Start Sail.
  2. Ensure the queue worker is running through Sail.
  3. Ensure the database contains at least one workspace with operator-visible operation runs for covered types.
  4. Ensure test fixtures or factories can create OperationRun records in queued, running, and completed states.

Implementation Validation Order

1. Run focused lifecycle service tests

vendor/bin/sail artisan test --compact --filter=OperationRunService

Expected outcome:

  • Stale queued reconciliation still works.
  • Stale running reconciliation is added and service-owned.
  • Terminal runs are not mutated.

2. Run focused reconciler tests

vendor/bin/sail artisan test --compact --filter=LifecycleReconciler
vendor/bin/sail artisan test --compact --filter=stale

Expected outcome:

  • Stale queued runs are force-resolved to completed/failed.
  • Stale running runs are force-resolved to completed/failed.
  • Fresh runs remain untouched.
  • Reconciliation is idempotent across repeated execution.

3. Run focused failed-job bridge tests

vendor/bin/sail artisan test --compact --filter=failed
vendor/bin/sail artisan test --compact --filter=MaxAttempts
vendor/bin/sail artisan test --compact --filter=TimeoutExceeded

Expected outcome:

  • Covered jobs with direct failed() bridges map queue failure truth back to OperationRun.
  • Queue failures that never complete normal middleware finalization still converge through reconciliation.

4. Run the Run-126 regression scenario

vendor/bin/sail artisan test --compact --filter=Run126
vendor/bin/sail artisan test --compact --filter=orphaned

Expected outcome:

  • A run left in running without completeRun() or failRun() is marked terminal failed once the stale threshold is exceeded.
  • The operator-facing state no longer implies normal active work.

5. Run focused Monitoring UX tests

vendor/bin/sail artisan test --compact tests/Feature/Operations
vendor/bin/sail artisan test --compact --filter=Operations

Expected outcome:

  • The Operations index distinguishes fresh activity from stale or reconciled failure semantics.
  • The run detail distinguishes normal failure from reconciled lifecycle failure.
  • Canonical Monitoring authorization semantics remain intact.

6. Run runtime timing guard tests

vendor/bin/sail artisan test --compact --filter=retry_after
vendor/bin/sail artisan test --compact --filter=timeout

Expected outcome:

  • Covered lifecycle policy timeouts stay safely below effective retry_after.
  • Misaligned timing assumptions fail validation instead of remaining implicit.

Runtime notes

  • Covered lifecycle jobs now declare explicit timeout values and set failOnTimeout = true.
  • The lifecycle validator expects covered job timeouts and expected runtimes to stay below queue retry_after with a safety margin.
  • If queue worker settings change during rollout, run vendor/bin/sail artisan queue:restart so workers pick up the new lifecycle contract.
  • Production and staging stop-wait expectations must stay above the longest covered timeout so workers can exit cleanly instead of orphaning in-flight runs.

7. Manual smoke-check in the browser

  1. Open /admin/operations and inspect a fresh active run.
  2. Inspect a deliberately stale or reconciled run and confirm the list no longer presents it as ordinary in-progress work.
  3. Open /admin/operations/{run} for a reconciled run and confirm the detail page shows operator-safe lifecycle explanation plus secondary diagnostics.
  4. Confirm existing View run navigation remains canonical and no new destructive action is introduced.

Non-Goals For This Slice

  • No resumable execution or checkpoint recovery.
  • No queue backend replacement or Horizon adoption.
  • No new manual retry or re-drive UI.
  • No new OperationRun status enum.