TenantAtlas/specs/160-operation-lifecycle-guarantees/plan.md
2026-03-23 22:52:37 +01:00

18 KiB

Implementation Plan: Operation Lifecycle Guarantees & Queue-to-Domain Failure Reconciliation

Branch: 160-operation-lifecycle-guarantees | Date: 2026-03-23 | Spec: specs/160-operation-lifecycle-guarantees/spec.md Input: Feature specification from /specs/160-operation-lifecycle-guarantees/spec.md

Note: This template is filled in by the /speckit.plan command. See .specify/scripts/ for helper scripts.

Summary

Guarantee eventual terminal truth for covered queued OperationRun executions by hardening three seams that already exist but are incomplete: service-owned lifecycle transitions, queue-failure bridging, and stale-run healing. The implementation will keep the existing queuedrunningcompleted lifecycle model, extend OperationRunService with generic stale-running reconciliation and structured reconciliation metadata, introduce a configuration-backed lifecycle policy for covered operation types and timing thresholds, add reusable failed-job bridging for priority queued jobs, and expose freshness or reconciliation semantics on the existing Monitoring surfaces without changing the overall Operations information architecture.

This is a reliability-hardening feature, not a queue-platform rewrite. The plan therefore avoids a new orchestration subsystem, avoids a second run-state model, and avoids a resumability design. Instead it builds on current repo seams such as OperationRunService, TrackOperationRun, the existing restore adapter reconciler, the backup-schedule reconcile command, OperationUxPresenter, centralized badge rendering, and the canonical Monitoring pages.

Technical Context

Language/Version: PHP 8.4.15
Primary Dependencies: Laravel 12, Filament 5, Livewire 4, Pest 4, Laravel queue workers, existing OperationRunService, TrackOperationRun, OperationUxPresenter, ReasonPresenter, BadgeCatalog domain badges, and current Operations Monitoring pages
Storage: PostgreSQL for operation_runs, jobs, and failed_jobs; JSONB-backed context, summary_counts, and failure_summary; configuration in config/queue.php and config/tenantpilot.php
Testing: Pest 4 feature, unit, Filament or Livewire-focused tests, and focused queue interaction tests run through Laravel Sail
Target Platform: Laravel Sail web application serving the Filament admin panel and queued worker processes
Project Type: Laravel monolith web application
Performance Goals: Keep Monitoring render DB-only; ensure reconciliation scans active runs without material query explosion; guarantee deterministic convergence to terminal truth within scheduler cadence; avoid regressions in active-run list responsiveness
Constraints: Preserve service-owned OperationRun transitions, existing queued/running/completed statuses, existing outcome enums, and exactly-three-surface Ops-UX feedback; keep retry_after greater than job timeouts with safety margin; no new external calls during Monitoring render; no new panel provider or asset pipeline changes; destructive operations remain confirmation-backed under their originating features
Scale/Scope: V1 covers the exact operator-visible queued run types baseline_capture, baseline_compare, inventory_sync, policy.sync, policy.sync_one, entra_group_sync, directory_role_definitions.sync, backup_schedule_run, restore.execute, tenant.review_pack.generate, tenant.review.compose, and tenant.evidence.snapshot.generate, plus the shared queue-lifecycle infrastructure those types depend on

Constitution Check

GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.

Pre-Phase 0 Gate: PASS

  • Inventory-first: PASS. The feature does not alter inventory-versus-snapshot ownership and only hardens operational truth for queued work that already exists.
  • Read/write separation: PASS. No new remote write workflow is introduced. Existing write-capable operations such as restore keep their preview, confirmation, audit, and authorization requirements from their originating specs.
  • Graph contract path: PASS. No new Microsoft Graph contract or endpoint is introduced by this feature.
  • Deterministic capabilities: PASS. Existing operation start permissions and any dangerous follow-up actions remain tied to the canonical capability registry.
  • RBAC-UX planes: PASS. This feature stays in the admin /admin plane. Cross-plane behavior remains unchanged. Canonical Monitoring routes remain tenant-safe.
  • Workspace isolation: PASS. OperationRun remains workspace-scoped with optional tenant linkage, and workspace-level Monitoring continues to authorize from the run and its related workspace.
  • Tenant isolation: PASS. Tenant-linked runs shown on canonical Monitoring routes still require tenant entitlement and must remain 404 for non-entitled users.
  • Destructive confirmation: PASS. No new destructive action is added in V1. Existing destructive actions such as restore remain ->action(...)->requiresConfirmation() and server-authorized.
  • Global search safety: PASS. This feature does not introduce or alter globally searchable resources. Existing global-search page contracts remain unchanged.
  • Run observability and Ops-UX: PASS WITH REQUIRED HARDENING. The feature strengthens the existing OperationRun contract instead of bypassing it. Start surfaces remain enqueue-only. Monitoring remains DB-only. Terminal truth remains service-owned.
  • Ops-UX lifecycle ownership: PASS. Research confirms OperationRunService is already the canonical transition path and the plan preserves that boundary.
  • Ops-UX summary counts: PASS. No new summary-count shape is introduced; numeric-only summary rules remain unchanged.
  • Ops-UX guards: PASS WITH REQUIRED TEST UPDATES. The feature must extend regression guards to cover reconciliation transitions and failed-job bridging without permitting direct status mutation.
  • Ops-UX system runs: PASS. Initiator-null behavior remains unchanged; reconciled system runs remain auditable via Monitoring without terminal DB notification fan-out.
  • Automation and idempotency: PASS WITH REQUIRED EXPANSION. Existing active-run dedupe and WithoutOverlapping patterns can be reused, but the plan adds generic stale reconciliation and queue-failure bridging for more run types.
  • Badge semantics (BADGE-001): PASS WITH REQUIRED CENTRALIZATION. Freshness and reconciliation display must be derived through centralized presenters or badges rather than ad hoc table mappings.
  • UI naming (UI-NAMING-001): PASS. Operator-facing copy will use domain terms such as stale, reconciled, and infrastructure failure; low-level queue exceptions remain diagnostics-only.
  • Operator surfaces (OPSURF-001): PASS. The Operations index and run detail remain the canonical operator-first surfaces. Diagnostics stay secondary. No new dangerous UI workflow is introduced.
  • Filament Action Surface Contract: PASS. The feature changes semantics on existing Monitoring surfaces, not their action inventory.
  • Filament UX-001: PASS. Existing Operations pages remain in place and only gain new status semantics and diagnostics.
  • Livewire and Filament version safety: PASS. The plan remains Livewire v4 compliant and does not change Filament v5 panel registration in bootstrap/providers.php.
  • Asset strategy: PASS. No new global or panel-specific assets are planned, so deployment steps for filament:assets remain unchanged.

Project Structure

Documentation (this feature)

specs/160-operation-lifecycle-guarantees/
├── plan.md
├── research.md
├── data-model.md
├── quickstart.md
├── contracts/
│   └── operation-run-lifecycle.openapi.yaml
└── tasks.md

Source Code (repository root)

app/
├── Console/
│   └── Commands/
├── Filament/
│   ├── Pages/
│   │   └── Monitoring/
│   ├── Resources/
│   │   └── OperationRunResource.php
│   └── Widgets/
│       └── Operations/
├── Jobs/
│   └── Middleware/
├── Models/
│   └── OperationRun.php
├── Notifications/
├── Policies/
│   └── OperationRunPolicy.php
├── Services/
│   ├── Operations/
│   ├── Providers/
│   ├── SystemConsole/
│   └── OperationRunService.php
└── Support/
    ├── Badges/
    ├── OpsUx/
    ├── OperationRun*.php
    └── ReasonTranslation/
config/
├── queue.php
└── tenantpilot.php
routes/
└── web.php
tests/
├── Feature/
│   ├── Operations/
│   ├── Filament/
│   └── Rbac/
└── Unit/
    ├── Jobs/
    ├── Operations/
    └── Support/

Structure Decision: Use the existing Laravel monolith and strengthen the current operational support layer rather than introducing a separate orchestration subsystem. The implementation seam stays centered on OperationRunService, queue job classes and middleware under app/Jobs, generic reconciliation services or commands under app/Services and app/Console/Commands, and semantic presentation on the existing Monitoring pages, presenters, and badges.

Phase 0 Research Summary

  • TrackOperationRun correctly transitions a run to running, catches in-handle exceptions, and marks success when the job returns cleanly, but it does not protect against failures that happen before middleware entry or after infrastructure has already declared the job dead.
  • OperationRunService already owns transitions and already has reusable stale-queued detection via isStaleQueuedRun() plus failStaleQueuedRun(), but there is no generic stale-running detection or stale-running fail helper.
  • The repo already has two partial reconciliation patterns: TenantpilotReconcileBackupScheduleOperationRuns for backup_schedule_run and AdapterRunReconciler plus ops:reconcile-adapter-runs for restore.execute. Both prove the repo accepts DB-driven healing, but both are too type-specific for Spec 160.
  • Research confirmed only one queued job currently implements a failed(Throwable $e) bridge: BulkBackupSetRestoreJob. Priority operation jobs such as baseline capture, baseline compare, inventory sync, policy sync, restore execution, and tenant review generation do not yet bridge failed queue truth back to OperationRun through failed().
  • Queue timing defaults currently use retry_after = 600 seconds for database, Redis, and Beanstalk connections, while at least some long-running jobs declare public int $timeout = 300;. The repo does not yet centralize or validate the timing invariant per covered operation type.
  • Laravel queue documentation for version 12 confirms the relevant design rules: failed() is invoked when a job exhausts attempts or times out, the exception may be MaxAttemptsExceededException or TimeoutExceededException, attempts can be consumed without handle() executing, and job timeout must remain shorter than retry_after to avoid duplicate or orphaned processing.
  • Existing operator-facing Monitoring already has reusable seams for safe UI semantics: OperationUxPresenter, ReasonPresenter, OperationRunStatusBadge, OperationRunOutcomeBadge, and canonical Operations pages. These should be extended rather than bypassed.
  • Existing OperationRunTriageService already proves the repo is willing to store triage metadata in context['triage'] and keep service-owned terminal transitions. The same pattern can be reused for reconciliation metadata without requiring a schema change in the first slice.

Phase 1 Design

Implementation Approach

  1. Extend the existing OperationRunService into a generic lifecycle-healing seam.

    • Add generic running-freshness assessment parallel to isStaleQueuedRun().
    • Add a service-owned stale-running failure transition parallel to failStaleQueuedRun().
    • Normalize reconciliation metadata and reason codes through the same service-owned transition path.
  2. Introduce a configuration-backed lifecycle policy for covered operation types.

    • Define which run types are in scope for V1.
    • Define status-specific stale thresholds and expected timeout bounds per type.
  • Record for each covered type whether terminal truth is guaranteed by a direct failed-job bridge, scheduled reconciliation, or both.
  • Keep this policy in configuration rather than a new table so rollout remains schema-light.
  1. Use a layered truth-bridge strategy instead of a single mechanism.

    • First-line bridge: reusable failed(Throwable $e) handling for covered jobs that can deterministically resolve their owning OperationRun.
    • Safety-net bridge: scheduled stale-run reconciliation that heals queued or running runs when direct failure handling never executes.
    • Reject a failed-jobs-parser-first design for V1 because payload parsing is brittle compared with job-owned identity plus stale reconciliation.
  2. Generalize current type-specific reconcile patterns into one lifecycle reconciler.

    • Keep restore adapter reconciliation and backup schedule reconciliation as precedents.
    • Introduce a generic active-run reconciliation service or command that scans covered queued and running runs, evaluates freshness, and force-resolves only when evidence justifies it.
    • Preserve idempotency so repeated runs do not overwrite already terminal records.
  3. Keep the current status model and enrich semantics through derived freshness state.

    • Preserve queued, running, and completed plus existing outcomes.
    • Introduce derived freshness states such as fresh, likely_stale, and reconciled_failed at the presenter and contract layer, not as new top-level database statuses.
    • Centralize that mapping through existing Ops-UX and badge helpers.
  4. Validate runtime timing as part of product correctness.

    • Capture the invariant that worker timeout and per-job timeout must stay below queue retry_after with safety margin.
    • Add focused validation or tests that fail when a covered lifecycle policy violates that invariant.
    • Keep deployment documentation explicit about queue worker restart and stop-wait expectations.

Planned Workstreams

  • Workstream A: Lifecycle policy core

    • Add a configuration-backed coverage and threshold registry for V1 operation types.
    • Extend OperationRunService with generic stale-running assessment and reconciliation helpers.
  • Workstream B: Queue-failure bridges

    • Introduce a reusable failed-job bridge pattern for priority covered jobs.
    • Normalize the existing restore bridge and apply the direct bridge pattern first to baseline capture, baseline compare, inventory sync, policy sync, and tenant review composition paths while the lifecycle policy explicitly marks which covered types rely on scheduled reconciliation.
  • Workstream C: Generic stale-run reconciliation

    • Create a lifecycle reconciler service and scheduled command for covered queued and running runs.
    • Fold current type-specific healing logic into reusable primitives where possible.
  • Workstream D: Monitoring semantics

    • Extend existing presenter and badge seams to distinguish fresh active, likely stale, and reconciled-failed semantics.
    • Update the Operations index and run detail to surface those meanings without changing route authority or action inventory.
    • Expose minimal aggregate reconciliation visibility through Monitoring filters or queryable summaries instead of a new dashboard.
  • Workstream E: Runtime invariant enforcement

    • Add validation for timeout and retry_after alignment and document deployment expectations for workers.
    • Keep the validation focused on covered operation types rather than every queued class in the repo.
  • Workstream F: Regression hardening

    • Add focused Pest coverage for stale queued reconciliation, stale running reconciliation, idempotency, failed-job bridging, UI truth semantics, and queue timing guards.

Testing Strategy

  • Add unit tests for OperationRunService stale-running assessment and stale-running failure transitions alongside existing stale-queued behavior.
  • Add unit or feature tests for the generic lifecycle reconciler covering stale queued runs, stale running runs, fresh queued runs, fresh running runs, already completed runs, and idempotent repeat execution.
  • Add focused queue-lifecycle tests proving that a covered job with a failed() callback transitions the owning run to terminal failed truth via OperationRunService.
  • Add a Run-126-style regression test where a run is left running, normal finalization never executes, time advances beyond threshold, and reconciliation marks it terminal failed.
  • Add Monitoring-focused Filament or feature tests proving the Operations index and run detail distinguish fresh activity from stale or reconciled failure semantics without implying indefinite active progress.
  • Add focused Monitoring coverage for minimal aggregate reconciliation visibility, such as filters or summaries that show how many runs were reconciled and which covered types are most affected.
  • Add focused authorization coverage to confirm canonical Monitoring access remains 404 for non-entitled users and capability denial remains 403 where relevant.
  • Add timing guard tests or assertions that fail when covered lifecycle policy timeouts are not safely below retry_after.
  • Add representative start-surface coverage proving queued intent still uses OperationUxPresenter, and all View run affordances still resolve to the canonical tenantless Monitoring run detail.
  • Run the minimum focused Pest suite through Sail; full-suite execution is not required for planning artifacts.

Post-Phase 1 Re-check: PASS

  • The design extends existing service, presenter, and command seams instead of introducing a second orchestration stack.
  • No new panel provider, no Livewire version change, and no Filament asset change is required; provider registration remains unchanged in bootstrap/providers.php.
  • No new global-search behavior is introduced. Existing globally searchable resources remain governed by their current View or Edit page contracts.
  • Destructive actions remain confirmation-backed under existing feature flows; this feature only hardens eventual run truth.

Complexity Tracking

No constitution violations or exceptional complexity are planned at this stage.