Ahmed Darrazi 59fc90a4db feat: harden operation lifecycle monitoring

2026-03-23 22:52:37 +01:00

27 KiB

Raw Blame History

Feature Specification: Operation Lifecycle Guarantees & Queue-to-Domain Failure Reconciliation

Feature Branch: 160-operation-lifecycle-guarantees
Created: 2026-03-23
Status: Draft
Input: User description: "Introduce explicit lifecycle guarantees for queued OperationRun executions so no operation can remain indefinitely ambiguous, orphaned, or misleadingly active without the platform eventually forcing a deterministic terminal truth."

Spec Scope Fields (mandatory)

Scope: workspace
Primary Routes:
- /admin/operations
- /admin/operations/{run}
- Existing operator-facing start surfaces that enqueue covered OperationRun work, including baseline capture, baseline compare, inventory sync, policy sync, Entra group sync, directory role-definition sync, backup schedule execution, restore execution, review pack generation, tenant review composition, and evidence snapshot generation flows
Data Ownership:
- OperationRun remains the canonical operational record for queued, operator-visible work.
- OperationRun remains a workspace-scoped monitoring record with optional tenant linkage for tenant-bound runs.
- Queue infrastructure evidence remains operational support data and must be mappable back to the owning OperationRun without changing workspace or tenant ownership boundaries.
- Reconciliation outcomes, failure reasons, and freshness semantics remain part of the same operational truth model rather than a separate shadow state.
RBAC:
- Authorization plane involved: admin /admin including workspace-level Monitoring and tenant-context entry points that can start or inspect covered operations.
- Non-members or actors lacking workspace or tenant entitlement for a run receive deny-as-not-found behavior.
- Members who are allowed to view a run but lack capability to start or re-trigger a dangerous operation receive forbidden behavior.
- Existing capability registry remains canonical for operation start permissions and any dangerous follow-up actions.
- This feature does not broaden /system access and does not weaken tenant-safe run visibility.

For canonical-view specs, the spec MUST define:

Default filter behavior when tenant-context is active: /admin/operations may continue to prefilter to the selected tenant as a convenience, but lifecycle legitimacy, stale detection, and terminal truth must not depend on selected tenant context. /admin/operations/{run} remains record-authoritative even when tenant context is mismatched, stale, or empty.
Explicit entitlement checks preventing cross-tenant leakage: The canonical run viewer must authorize from the resolved OperationRun, its workspace relationship, and tenant entitlement for any referenced tenant. Lifecycle reconciliation or stale labeling must not reveal tenant-linked run details to non-members or non-entitled actors.

Operator Surface Contract (mandatory when operator-facing surfaces are changed)

Surface	Primary Persona	Surface Type	Primary Operator Question	Default-visible Information	Diagnostics-only Information	Status Dimensions Used	Mutation Scope	Primary Actions	Dangerous Actions
Operations index	Workspace operator	List	Which operations are truly active, which are stale, and which require attention?	Run type, initiator, tenant or workspace scope, current lifecycle state, freshness signal, terminal outcome when available	Failure reason codes, reconciliation timestamps, queue-origin evidence, stale-threshold rationale	lifecycle, execution outcome, freshness, tenant scope	TenantPilot only	View run, filter active or stale runs	None on the index in V1
Operation run detail	Workspace operator	Detail	Did this run finish normally, fail normally, or get force-reconciled after losing lifecycle truth?	Run identity, started or completed timing, outcome, plain-language failure explanation, whether the platform reconciled the run automatically	Structured infrastructure reason, stale classification evidence, queue failure linkage, reconciliation audit context	lifecycle, execution outcome, freshness, auditability	TenantPilot only	Back to Operations, inspect run history and failure detail	Any future retry or re-drive affordance remains out of scope for V1
Existing operation start surfaces	Tenant operator or workspace operator	Action entry	If I start work now, will the platform later tell me the truth even when infrastructure goes wrong?	Existing queued intent messaging, run title, mutation scope, confirmation language where already required	Lifecycle contract diagnostics stay secondary to avoid cluttering start surfaces	lifecycle readiness, mutation scope	TenantPilot only, Microsoft tenant, or simulation only depending on the initiating feature	Start operation, view run	Existing destructive operations such as restore remain confirmation-protected and auditable

User Scenarios & Testing (mandatory)

User Story 1 - Force Terminal Truth For Orphaned Runs (Priority: P1)

As an operator, I need every covered queued operation to end in a trustworthy terminal state even when the queue infrastructure fails before normal execution cleanup happens, so that no run remains indefinitely ambiguous.

Why this priority: This is the core reliability failure. If terminal truth is not guaranteed, every higher-level governance workflow inherits false operational state.

Independent Test: Can be fully tested by creating a covered OperationRun, preventing the normal completion or failure path from finalizing it, advancing time past the stale threshold, and verifying that the system reconciles it to a terminal failed truth with an operator-safe reason.

Acceptance Scenarios:

Given a covered queued run moves to running and its normal lifecycle completion path never executes, When the platform detects that the run is stale, Then the run is forced to terminal failed truth with an explicit infrastructure-oriented reason.
Given a covered queued run is failed by queue infrastructure before normal middleware finalization can update the domain record, When reconciliation or a direct failure bridge runs, Then the queue failure is reflected in the owning run instead of remaining only in infrastructure evidence.

User Story 2 - Show Honest Liveness In Monitoring (Priority: P2)

As an operator viewing Monitoring, I need the UI to stop implying that obviously stale work is still normally active, so that I can decide whether to investigate, retry manually later, or move on.

Why this priority: The current misleading spinner is the visible symptom of the underlying integrity bug. Honest operator truth is required for trust and supportability.

Independent Test: Can be fully tested by presenting fresh active runs, stale non-terminal runs, and reconciled failed runs on the Operations surfaces and verifying that each state is represented distinctly without indefinite optimistic activity cues.

Acceptance Scenarios:

Given a run is fresh and plausibly progressing, When the operator opens the Operations index or run detail, Then the UI presents it as legitimately active.
Given a run has exceeded the accepted freshness window without credible progress evidence, When the operator views it before or after reconciliation, Then the UI no longer implies normal active progress and surfaces stale or reconciled failure semantics clearly.

User Story 3 - Prevent Repeat Incidents Through Lifecycle Contracts (Priority: P3)

As a platform owner, I need covered queued jobs and runtime defaults to satisfy explicit lifecycle and timing guarantees, so that reservation expiry, timeout misalignment, and silent lifecycle divergence stop being accepted as normal behavior.

Why this priority: Reconciliation heals truth after failure, but the platform also needs guardrails that reduce the chance of creating orphaned runs in the first place.

Independent Test: Can be fully tested by verifying that covered jobs declare explicit lifecycle bounds, that runtime timing relationships are documented and validated, and that misaligned or ambiguous lifecycle settings are caught by automated checks.

Acceptance Scenarios:

Given a covered long-running operation type is introduced or updated, When its lifecycle contract is reviewed, Then the job has an explicit timeout strategy and a credible path to terminal failure truth.
Given deployment or worker timing values would allow legitimate work to outlive queue reservation semantics, When the platform validates lifecycle runtime assumptions, Then the mismatch is detected or documented as invalid rather than silently accepted.

Edge Cases

A run remains in queued because a worker never starts it, the worker crashes before status handoff, or dispatch-level failure evidence exists without a domain transition.
A run remains in running because the process dies, is killed, or times out after setting active state but before normal finalization.
Queue infrastructure records a decisive terminal failure but no matching middleware or in-job cleanup path executes.
A run becomes terminal through the normal lifecycle path shortly before reconciliation inspects it; reconciliation must not overwrite or flap a legitimately completed run.
A long-running but healthy job approaches the stale threshold; thresholds must be conservative enough to avoid false terminal failure for legitimate work.
A workspace-level run with no tenant reference must still converge to terminal truth without being mistaken for a tenant-leakage exception.
Scheduled or system-initiated runs with no initiator must still become terminally truthful while respecting the initiator-null notification rule.

Requirements (mandatory)

Constitution alignment (required): This feature governs long-running, queued, and scheduled OperationRun work. It introduces no new Microsoft Graph contracts by itself, but it does require explicit lifecycle guarantees for covered OperationRun types, run observability on Monitoring surfaces, audit-safe failure semantics, and regression tests. Existing operation start surfaces must preserve their current safety gates, previews, confirmations, and audit requirements. If a covered flow performs a dangerous mutation such as restore, that mutation remains confirmation-protected and auditable; this feature only guarantees that its run truth cannot remain orphaned.

Constitution alignment (OPS-UX): This feature reuses existing OperationRun records and remains fully subject to the three-surface Ops-UX contract. Queued intent feedback remains toast-only. Active awareness remains limited to the active-operations widget and Monitoring run views. Terminal truth remains represented by the canonical run record and terminal notification policy. OperationRun.status and OperationRun.outcome transitions remain service-owned and must continue to occur only through OperationRunService, including reconciliation transitions. summary_counts remain numeric-only and keyed from the canonical registry. Scheduled and system runs continue to omit terminal DB notifications when there is no initiator, while Monitoring remains the authoritative audit surface. Regression coverage must include service-owned lifecycle transitions, stale-run reconciliation, and failure bridging without reintroducing direct status mutation.

Constitution alignment (RBAC-UX): This feature does not broaden authorization scope but does change the truth semantics shown on /admin/operations and /admin/operations/{run}. The admin /admin plane remains the only plane involved. Cross-plane access remains deny-as-not-found. For this feature, 404 means the actor is not entitled to the workspace or tenant scope of the run; 403 means the actor is in scope but lacks a capability for an operation-start or dangerous follow-up action. Authorization remains server-side through existing Gates, Policies, and the canonical capability registry. Global search and linked Monitoring access remain tenant-safe. Existing destructive-like actions such as restore remain confirmation-required. Validation must include at least one positive Monitoring access test and one negative tenant-entitlement or capability test alongside the lifecycle tests.

Constitution alignment (OPS-EX-AUTH-001): Not applicable beyond reaffirming that lifecycle bridging and stale reconciliation belong to Monitoring and queued operation execution only. Authentication handshake exceptions on /auth/* remain unrelated and must not be used as an exception path here.

Constitution alignment (BADGE-001): This feature changes the meaning shown by lifecycle and outcome badges on Operations surfaces. Badge semantics for queued, running, completed, failed, and any stale or reconciled indicators must remain centralized so the same run state is not mapped differently across the index, detail view, and widgets. Tests must cover any newly exposed stale or reconciled display values.

Constitution alignment (UI-NAMING-001): The target object is the OperationRun. Primary operator verbs remain View run and existing start verbs from initiating surfaces. New operator-facing copy must favor domain language such as stale, reconciled, infrastructure failure, and no longer active over implementation-first phrasing such as MaxAttemptsExceededException or retry_after mismatch in primary labels. Low-level queue or reservation details may appear only as secondary diagnostics. The same lifecycle vocabulary must be preserved across run titles, status presentation, notifications, and audit prose.

Constitution alignment (OPSURF-001): This feature materially refactors the meaning of the Operations index and run detail without replacing their overall layout. Default-visible content on /admin must remain operator-first: what ran, whether it is still trustworthy as active, how it ended, and whether the platform reconciled it automatically. Raw queue evidence and stale-threshold reasoning remain diagnostics-only. Status dimensions must remain distinct: execution outcome, lifecycle state, and freshness or reconciliation state must not collapse into one misleading badge. Mutating start surfaces continue to disclose mutation scope before execution through their originating specs. Dangerous actions continue to follow the existing safe-execution pattern; this feature does not introduce new dangerous actions.

Constitution alignment (Filament Action Surfaces): This feature modifies existing Filament Monitoring surfaces and therefore includes the UI Action Matrix below. The Action Surface Contract is satisfied. No exemption is needed because the feature changes state semantics and diagnostics, not the basic action inventory.

Constitution alignment (UX-001 — Layout & Information Architecture): Existing Operations list and run detail layouts remain in place. UX-001 stays satisfied so long as the run detail continues to present grouped operational sections instead of raw dumps, and the Operations table preserves search, sort, and filtering over core dimensions such as type, status, outcome, freshness, and tenant scope. Empty states remain single-CTA and explanatory. This feature changes semantic truth, not form layout.

Functional Requirements

FR-160-001: The system MUST guarantee that every covered queued OperationRun reaches eventual terminal domain truth or remains demonstrably fresh and legitimately active.
FR-160-002: The system MUST define the covered OperationRun operation types for V1 as baseline capture, baseline compare, inventory sync, policy sync including single-policy sync, Entra group sync, directory role-definition sync, backup schedule execution, restore execution, review pack generation, tenant review composition, and evidence snapshot generation.
FR-160-003: The system MUST provide at least one deterministic bridge from queue or infrastructure terminal failure evidence back to the owning OperationRun for every covered run type.
FR-160-004: A queue-level terminal failure for a covered run MUST eventually be reflected on the owning OperationRun as terminal failed truth rather than remaining only in infrastructure records or logs.
FR-160-005: The system MUST detect stale covered runs in queued or running states using explicit freshness bounds that distinguish fresh active work from orphaned or abandoned work.
FR-160-006: The system MUST reconcile stale covered queued runs to terminal failed truth when no credible evidence exists that the run is still legitimately progressing.
FR-160-007: The system MUST reconcile stale covered running runs to terminal failed truth when no credible evidence exists that the run is still legitimately progressing.
FR-160-008: Reconciliation MUST be conservative, idempotent, and limited to non-terminal runs; it MUST never mutate a legitimately completed run.
FR-160-009: Reconciled terminal failures MUST preserve operator-safe, infrastructure-oriented reason semantics that distinguish normal execution failure from stale or orphaned lifecycle healing.
FR-160-010: Covered queued jobs MUST satisfy a minimum lifecycle contract that provides either a direct terminal-failure bridge, a shared inherited failure bridge, or a documented fallback reconciliation path to eventual terminal truth.
FR-160-011: Covered long-running jobs MUST declare explicit runtime bounds rather than relying solely on implicit worker defaults.
FR-160-012: The timeout behavior for covered long-running jobs MUST be intentionally defined so that timeout-related failure semantics are not ambiguous.
FR-160-013: The platform MUST preserve traceable linkage between a failed or reconciled infrastructure event and the owning OperationRun.
FR-160-014: The platform MUST establish and document a runtime timing invariant that keeps queue reservation timing safely above legitimate covered job execution duration.
FR-160-015: Lifecycle runtime validation MUST make misaligned timing relationships detectable rather than allowing silent divergence between queue truth and domain truth.
FR-160-016: The Operations index and run detail MUST distinguish legitimately active runs from likely stale runs and reconciled failures without implying indefinite normal activity for obviously stale work.
FR-160-017: Operator-facing Monitoring surfaces MUST communicate whether a run ended normally or was force-resolved by lifecycle reconciliation.
FR-160-018: Existing happy-path lifecycle handling such as middleware-based run tracking MUST remain valid but MUST no longer be treated as the only mechanism that can guarantee terminal truth.
FR-160-019: The system MUST support evidence-based reconciliation using available operational signals such as run freshness, queue failure evidence, and other non-render-time lifecycle evidence that does not require external calls during Monitoring page render.
FR-160-020: The system MUST preserve the current top-level lifecycle model of queued, running, and completed while enriching failure reason and freshness interpretation instead of introducing a second terminal-state model for V1.
FR-160-021: Reconciliation and direct failure bridging MUST remain auditable, including when reconciliation happened, why the run was judged stale or orphaned, and whether the normal lifecycle path was bypassed.
FR-160-022: The system SHOULD make it possible to recover aggregate visibility into how many runs were force-reconciled and which operation types are most affected, even if V1 does not add a dedicated observability dashboard.
FR-160-023: Manual database or Tinker intervention MUST no longer be the normal recovery path for orphaned covered runs.
FR-160-024: Validation coverage MUST include stale queued reconciliation, stale running reconciliation, idempotency, fresh-run non-interference, direct failure bridging, normal failure-path coexistence, a Run-126-style orphaned running regression, and runtime timing guard coverage where practical.

UI Action Matrix (mandatory when Filament is changed)

If this feature adds/modifies any Filament Resource / RelationManager / Page, fill out the matrix below.

For each surface, list the exact action labels, whether they are destructive (confirmation? typed confirmation?), RBAC gating (capability + enforcement helper), and whether the mutation writes an audit log.

Surface	Location	Header Actions	Inspect Affordance (List/Table)	Row Actions (max 2 visible)	Bulk Actions (grouped)	Empty-State CTA(s)	View Header Actions	Create/Edit Save+Cancel	Audit log?	Notes / Exemptions
Operations index	`/admin/operations`	Existing filter and navigation actions remain	Existing linked rows to run detail remain the inspect affordance	`View run` remains the primary inspect action	Existing grouped bulk actions unchanged if present	Existing empty-state CTA remains	Not applicable	Not applicable	No direct mutation from index in V1	Table semantics change to show freshness and reconciliation truth without changing the core action surface.
Operation run detail	`/admin/operations/{run}`	`Back to Operations` and existing contextual navigation remain	Route-resolved record inspection	No new row actions	None	Not applicable	Existing view actions remain; no new destructive action is introduced	Not applicable	No direct mutation from the page in V1	Detail page exposes reconciled or stale semantics and diagnostics but does not add retry or re-drive actions in V1.
Existing operation start surfaces	Existing baseline, restore, backup schedule, inventory sync, and review generation surfaces	Existing start or preview actions remain	Existing route or view affordances remain	Existing `View run` or equivalent inspect affordance remains where already present	Existing grouped actions remain	Existing empty-state CTA remains	Existing view header actions remain	Existing save and cancel behavior unchanged	Yes where the originating feature already writes audit logs	Exemption: this spec changes lifecycle guarantees behind existing start actions, not the visible action inventory. Existing destructive operations remain confirmation-required under their originating specs.

Key Entities (include if feature involves data)

Covered Operation Run: An operator-visible queued run whose lifecycle truth must converge even when queue infrastructure fails outside the normal happy path.
Queue Failure Evidence: Any infrastructure-originated signal that indicates a covered queued job terminally failed, timed out, exhausted attempts, or became otherwise non-viable.
Lifecycle Reconciliation Decision: The platform decision that a non-terminal run is stale or orphaned and must be force-resolved to terminal failed truth.
Freshness Window: The accepted age or progress boundary that separates plausibly active queued or running work from stale work.
Lifecycle Contract: The minimum expectations a covered queued job must satisfy so the platform can map infrastructure truth back to domain truth.
Runtime Timing Invariant: The deployment and worker relationship that prevents legitimate covered work from outliving queue reservation semantics.

Success Criteria (mandatory)

Measurable Outcomes

SC-160-001: In focused lifecycle regression coverage, 100% of covered stale queued runs and stale running runs are force-resolved to terminal failed truth within the configured reconciliation window.
SC-160-002: In focused lifecycle regression coverage, 0 fresh covered runs are incorrectly reconciled while still within their accepted freshness window.
SC-160-003: In focused Run-126-style regression coverage, 100% of simulated orphaned runs that never receive normal completion or failure callbacks stop appearing as indefinitely active work.
SC-160-004: In focused Monitoring UX coverage, operators can distinguish normal failure from reconciled lifecycle failure on 100% of covered scenarios exercised by the spec tests.
SC-160-005: In focused lifecycle contract coverage, 100% of V1-covered queued operation types demonstrate a credible terminal-truth path through either direct failure bridging or reconciliation.
SC-160-006: In focused runtime guard coverage, timing relationships that would allow legitimate covered work to outlive queue reservation semantics are detected or documented rather than silently accepted.

Assumptions

Existing OperationRun top-level statuses and outcomes remain sufficient for V1 if failure reasons and freshness semantics are enriched.
Existing Monitoring pages remain the canonical operator-facing surfaces for run truth and do not perform external calls during render.
Existing happy-path middleware and service-based lifecycle handling remain useful and are preserved as first-line execution handling.
V1 prioritizes deterministic terminal truth over resumability, automatic re-drive, or advanced checkpoint-based recovery.
Covered operation types are limited to operator-visible runs that materially own or advance business-relevant work.

Dependencies

Existing OperationRun domain model and OperationRunService remain the canonical ownership boundary for lifecycle transitions.
Existing Monitoring and Operations surfaces remain the canonical render surfaces for run truth.
Existing initiating specs for restore, baseline, backup scheduling, inventory synchronization, and review generation remain the source of truth for their mutation scope and confirmation policy.

Risks

Stale thresholds that are too aggressive could force-fail legitimate long-running work.
A reconciliation safety net could hide recurring infrastructure weaknesses if aggregate visibility is not preserved.
Partial adoption across covered operation types could leave some orphaned paths untreated and create a false sense of completion.
Teams may mistake lifecycle reconciliation for full resiliency even though resumability and advanced retry orchestration remain out of scope.

Summary

Run 126 showed that queue truth, domain truth, and operator-visible truth can diverge when infrastructure-level failure happens before the normal lifecycle path finishes its cleanup. This feature closes that gap by establishing a platform rule: no covered queued OperationRun may remain indefinitely ambiguous. If the system cannot prove that a run is still legitimately active, it must eventually reconcile that run to deterministic terminal truth.

V1 delivers that guarantee through three coordinated changes: a minimum lifecycle contract for covered queued jobs, a deterministic bridge from queue or infrastructure failure back to the owning run, and conservative stale-run reconciliation that heals orphaned queued or running runs. Monitoring surfaces then present that truth honestly by distinguishing fresh activity, likely stale work, and reconciled lifecycle failure without adding a second status model or promising resumability.

27 KiB Raw Blame History

Feature Specification: Operation Lifecycle Guarantees & Queue-to-Domain Failure Reconciliation

Spec Scope Fields (mandatory)

Operator Surface Contract (mandatory when operator-facing surfaces are changed)

User Scenarios & Testing (mandatory)

User Story 1 - Force Terminal Truth For Orphaned Runs (Priority: P1)

User Story 2 - Show Honest Liveness In Monitoring (Priority: P2)

User Story 3 - Prevent Repeat Incidents Through Lifecycle Contracts (Priority: P3)

Edge Cases

Requirements (mandatory)

Functional Requirements

UI Action Matrix (mandatory when Filament is changed)

Key Entities (include if feature involves data)

Success Criteria (mandatory)

Measurable Outcomes

Assumptions

Dependencies

Risks

Summary

27 KiB

Raw Blame History