TenantAtlas/specs/178-ops-truth-alignment/spec.md
2026-04-05 23:40:45 +02:00

277 lines
38 KiB
Markdown

# Feature Specification: Operations Lifecycle Alignment & Cross-Surface Truth Consistency
**Feature Branch**: `178-ops-truth-alignment`
**Created**: 2026-04-05
**Status**: Proposed
**Input**: User description: "Spec 178 - Operations Lifecycle Alignment & Cross-Surface Truth Consistency"
## Spec Scope Fields *(mandatory)*
- **Scope**: tenant + workspace + canonical-view + platform
- **Primary Routes**:
- `/admin` as the workspace overview surface where workspace attention and workspace recent operations appear
- `/admin/t/{tenant}` as the tenant dashboard surface where tenant attention, recent operations, and active progress affordances appear
- `/admin/operations` as the canonical monitoring hub and drill-through destination from admin-plane summaries
- `/admin/operations/{run}` as the canonical run detail surface
- `/system/ops/runs`, `/system/ops/failures`, and `/system/ops/stuck` as the platform-plane monitoring registry surfaces
- `/system/ops/runs/{run}` as the platform-plane operation detail surface
- **Data Ownership**:
- Existing `OperationRun` records remain the only canonical lifecycle source of truth for queued, running, completed, stale, and automatically reconciled runs
- Existing workspace-owned monitoring truth with optional tenant linkage remains in place; the feature does not add a second summary record, mirror lifecycle store, or notification-specific state model
- Freshness interpretation, stale or reconciled visibility, terminal follow-up grouping, and cross-surface drill-through continuity remain derived views over existing `OperationRun` truth
- No schema migration, no new persisted lifecycle state, and no enum rewrite are introduced
- **RBAC**:
- Admin-plane summary and canonical-view surfaces continue to require workspace membership, and any tenant-bound summary or run detail continues to require tenant entitlement for the referenced tenant
- Platform-plane system surfaces continue to rely on existing system operations view and manage capabilities without broadening `/system` access
- Non-members or users outside the relevant workspace or tenant scope remain `404`; in-scope users lacking a capability for a guarded follow-up affordance remain `403`
- Cross-plane navigation must remain explicit and must not leak tenant truth from admin surfaces into system surfaces or vice versa
For canonical-view specs, the spec MUST define:
- **Default filter behavior when tenant-context is active**: `/admin/operations` may continue to prefilter to the active tenant, but dashboard, attention, recent-operations, and notification drill-throughs MUST also preserve the originating problem class so operators land on the same issue family they clicked. Operators may broaden filters only within already entitled scope.
- **Explicit entitlement checks preventing cross-tenant leakage**: Every admin-plane summary claim, pre-applied filter, run detail page, and related drill-through MUST resolve only after workspace membership and tenant entitlement checks against the referenced run. Reconciled, stale, terminal-failure, and follow-up states must not reveal another tenant's existence or activity to unauthorized users.
## UI/UX Surface Classification *(mandatory when operator-facing surfaces are changed)*
| Surface | Surface Type | Primary Inspect/Open Model | Row Click | Secondary Actions Placement | Destructive Actions Placement | Canonical Collection Route | Canonical Detail Route | Scope Signals | Canonical Noun | Critical Truth Visible by Default | Exception Type |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Tenant dashboard operations attention | Embedded attention summary | One explicit problem-class CTA per summary bucket | forbidden | none | none | `/admin/operations` | `/admin/operations/{run}` | Active tenant context and tenant-preserving destination state | Operations / Operation | Separate terminal issues from stale active issues | Multi-bucket summary surface |
| Tenant dashboard recent operations | Diagnostic recency table | Row open to canonical operation detail | required | header link only | none | `/admin/operations` with tenant prefilter | `/admin/operations/{run}` | Active tenant context and tenant-scoped recent activity | Operations / Operation | Fresh active, likely stale, and terminal follow-up states remain distinguishable per row | none |
| Bulk operation progress | Live progress indicator | Compact item open to canonical operation detail plus collection fallback | compact item link only | collection link only | none | `/admin/operations` with tenant prefilter | `/admin/operations/{run}` | Active tenant context and active-run-only framing | Operations / Operation | Only truly active or still-problematic runs remain visible | Compact progress surface |
| Workspace operations attention | Embedded attention summary | One explicit problem-class CTA per summary bucket | forbidden | none | none | `/admin/operations` | `/admin/operations/{run}` | Workspace scope plus tenant counts where relevant | Operations / Operation | Separate terminal issues from stale active issues across the workspace | Multi-bucket summary surface |
| Workspace recent operations | Diagnostic recency table | Row open to canonical operation detail | required | header link only | none | `/admin/operations` | `/admin/operations/{run}` | Workspace scope with tenant identity per row | Operations / Operation | Recent operations do not hide stale or reconciled truth behind generic recency language | none |
| Operations hub | Read-only Registry / Report | Full-row click to canonical operation detail | required | filters, tabs, and header-level context only | none on the list | `/admin/operations` | `/admin/operations/{run}` | Workspace scope, tenant filter state, and problem-class filter state | Operations / Operation | Lifecycle truth, freshness truth, and problem class are visible before opening detail | none |
| Canonical operation detail | Detail-first operational surface | Dedicated detail page | forbidden | detail header links only | none introduced by this spec | `/admin/operations` | `/admin/operations/{run}` | Workspace context, tenant context when relevant, and run identity | Operations / Operation | Decision-zone lifecycle truth and next step are visible without opening diagnostics | none |
| System failed operations | Read-only Registry / Report | Full-row click to system operation detail | required | header CTA only | none | `/system/ops/failures` | `/system/ops/runs/{run}` | Platform scope only | Operations / Operation | Terminal-problem truth remains aligned with admin-plane canonical truth | none |
| System stuck operations | Read-only Registry / Report | Full-row click to system operation detail | required | header CTA only | none | `/system/ops/stuck` | `/system/ops/runs/{run}` | Platform scope only | Operations / Operation | Active stale or stuck truth and reconciled visibility remain operator-legible | none |
## Operator Surface Contract *(mandatory when operator-facing surfaces are changed)*
| Surface | Primary Persona | Surface Type | Primary Operator Question | Default-visible Information | Diagnostics-only Information | Status Dimensions Used | Mutation Scope | Primary Actions | Dangerous Actions |
|---|---|---|---|---|---|---|---|---|---|
| Tenant dashboard operations attention | Tenant operator | Embedded attention summary | Do I have a terminal issue to follow up or an active run that is likely stale? | Separate counts or labels for terminal follow-up and stale active attention, with one matching destination each | Detailed failure payloads, count internals, and infrastructure evidence | problem class, urgency, tenant scope | none | Open terminal issues, open stale active issues | none |
| Tenant dashboard recent operations | Tenant operator | Diagnostic recency table | Which recent tenant operation should I inspect next? | Operation label, lifecycle truth, outcome, freshness truth, and recency | Failure internals, raw summary counts, extended diagnostics | execution status, execution outcome, freshness | none | Open operation detail, open operations list | none |
| Bulk operation progress | Tenant operator | Live progress indicator | Is this run really still active, or has its truth changed since enqueue time? | Active run identity, current visible lifecycle truth, and quick path to detail | Low-level progress internals and failure metadata | execution status, freshness, active visibility | none | Open operation detail, open operations list | none |
| Workspace operations attention | Workspace operator | Embedded attention summary | Which operation problem class needs workspace-level follow-up first? | Separate terminal issues from stale active issues, with workspace-safe destination semantics | Deep diagnostics remain on operations views | problem class, urgency, workspace spread | none | Open terminal issues, open stale active issues | none |
| Workspace recent operations | Workspace operator | Diagnostic recency table | Which operation across the workspace changed meaning recently? | Run identity, tenant, lifecycle truth, and recency | Deeper failure and reconciliation detail remain secondary | execution status, freshness, tenant scope | none | Open operation detail, open operations list | none |
| Operations hub | Workspace operator | Read-only Registry / Report | Is this run fresh active, likely stale, reconciled, or a terminal issue, and which bucket am I looking at? | Explicit problem-class framing, lifecycle truth, freshness truth, outcome, tenant or workspace scope | Queue internals, raw context, and extended traces | execution status, execution outcome, freshness, problem class | none | Open operation detail, adjust filter or tab | none |
| Canonical operation detail | Workspace operator | Detail-first operational surface | What happened, is the run still active, was it automatically reconciled, and what do I do next? | Primary decision zone with lifecycle assessment, active or not-active answer, reconciliation state, and one primary next step | Raw payloads, detailed failure arrays, and artifact-deep diagnostics | execution status, execution outcome, freshness, operator next action | none | Return to operations, open related artifact or follow-up destination | none introduced by this spec |
| System failed operations | Platform operator | Read-only Registry / Report | Which terminal operation issue needs platform investigation first? | Terminal problem class, operation identity, workspace, tenant, and recency | Deep diagnostics remain on system detail | execution outcome, terminal problem class, recency | none | Open operation detail, show all operations | none |
| System stuck operations | Platform operator | Read-only Registry / Report | Which active run crossed the stuck threshold or was recently auto-reconciled for that reason? | Stuck or stale class, operation identity, workspace, tenant, and recency | Deep diagnostics remain on system detail | freshness, lifecycle stall state, recency | none | Open operation detail, show all operations | none |
## Proportionality Review *(mandatory when structural complexity is introduced)*
- **New source of truth?**: No
- **New persisted entity/table/artifact?**: No
- **New abstraction?**: No
- **New enum/state/reason family?**: No
- **New cross-domain UI framework/taxonomy?**: Yes, but only as a narrow derived monitoring split between `terminal follow-up` and `active stale/stuck attention`, built from existing lifecycle truth rather than new stored state
- **Current operator problem**: Operators can currently see the same run framed as normal active progress on one surface, terminal or reconciled on another, invisible on system stuck surfaces, and mixed into a generic follow-up bucket elsewhere
- **Existing structure is insufficient because**: Existing surfaces already have valid local logic, but their aggregation, drill-through, and attention language do not consistently tell the same operator story for stale, reconciled, and terminal problem runs
- **Narrowest correct implementation**: Reuse the current `OperationRun`, status, outcome, freshness, and reconciliation model, then align summary buckets, filters, drill-throughs, and decision-zone emphasis across existing surfaces without adding persistence or a new lifecycle engine
- **Ownership cost**: The codebase takes on shared cross-surface classification rules, copy alignment, and regression coverage to keep dashboard, recent, bulk, admin monitoring, and system monitoring semantics locked together
- **Alternative intentionally rejected**: A new persisted problem-state model, an enum rewrite, a notification redesign, or a full operations architecture refactor were rejected because the present issue is truth drift between existing surfaces, not missing core domain structure
- **Release truth**: Current-release truth. The feature hardens already shipped lifecycle semantics before more triage or monitoring slices depend on them
## User Scenarios & Testing *(mandatory)*
### User Story 1 - Recover The Same Truth From Every Entry Point (Priority: P1)
As an operator, I want dashboard, attention, recent-operations, monitoring, and system surfaces to describe the same run with the same problem class, so that I do not have to guess which screen is telling the truth.
**Why this priority**: Cross-surface truth drift is the core trust problem. If the same run reads differently across entry points, every later triage decision becomes suspect.
**Independent Test**: Can be fully tested by seeding fresh active, likely stale, reconciled-failed, and terminal problem runs, then verifying that tenant, workspace, canonical, and system surfaces classify the same run consistently and drill through into matching destinations.
**Acceptance Scenarios**:
1. **Given** a run is canonically `likely_stale`, **When** an operator sees it on tenant attention, workspace attention, recent operations, the operations hub, and canonical detail, **Then** none of those surfaces frame it as an unremarkable normal active run.
2. **Given** a run is terminal with a blocked, partial, or failed outcome, **When** an operator reaches it from dashboard or monitoring summaries, **Then** the destination confirms a terminal follow-up problem rather than an active stale issue.
3. **Given** a run was automatically reconciled after becoming stale, **When** an operator checks admin monitoring and system monitoring surfaces, **Then** the stale or reconciled history remains discoverable instead of disappearing from the truth chain.
---
### User Story 2 - Trust Live Progress Without Waiting For A New Event (Priority: P1)
As a tenant operator, I want local progress and recent-activity surfaces to stop implying that a finished or reconciled run is still active, even when no new enqueue event occurs, so that I can trust what is on screen.
**Why this priority**: Bulk progress and recent activity are the most immediate trust surfaces. If they lag behind canonical truth, operators see false liveness first.
**Independent Test**: Can be fully tested by opening active-progress and recent-operations surfaces, changing the underlying run to terminal or reconciled truth without dispatching a new enqueue event, and verifying that local surfaces update within the allowed refresh window and then stop behaving like live active surfaces.
**Acceptance Scenarios**:
1. **Given** BulkOperationProgress is open for an active run, **When** the run completes or is automatically reconciled, **Then** the surface stops presenting it as active within the next refresh cycle even if no new enqueue event fires.
2. **Given** Recent Operations is visible on a tenant or workspace surface, **When** a displayed run becomes likely stale or terminal, **Then** the row updates to the new truth instead of continuing to imply healthy progress.
3. **Given** no relevant active runs remain, **When** the surface reaches that state, **Then** live refresh stops or becomes inactive instead of polling indefinitely.
---
### User Story 3 - Decide What To Do From The Canonical Detail Surface (Priority: P2)
As an operator opening canonical run detail, I want the primary decision zone to tell me immediately whether the run is still active, likely stale, already reconciled, or terminal and what the next step is, so that I do not have to derive action from scattered diagnostics.
**Why this priority**: Canonical detail is the highest-trust surface. If it makes lifecycle attention secondary, summary surfaces cannot reliably inherit the right operator interpretation.
**Independent Test**: Can be fully tested by opening stale, reconciled, partial, failed, and healthy active runs and verifying that the decision zone makes lifecycle truth and next action visible without relying on banners or secondary panels alone.
**Acceptance Scenarios**:
1. **Given** a run is likely stale but not yet reconciled, **When** the canonical detail page loads, **Then** the primary decision zone states that the run is still non-terminal but likely unhealthy and names the next investigation step.
2. **Given** a run has already been automatically reconciled, **When** the canonical detail page loads, **Then** the primary decision zone states that the run is no longer active, that reconciliation already happened, and what follow-up is appropriate.
3. **Given** a run type has deeper artifact truth, **When** the canonical detail page loads, **Then** lifecycle truth and next action remain visible before artifact-deep diagnostics.
---
### User Story 4 - Preserve Problem-Class Continuity In System And Notification Entry Points (Priority: P3)
As a system or workspace operator, I want notifications and platform monitoring entry points to confirm the same problem class that brought me there, so that I never land on a calmer or differently framed destination than the one I clicked.
**Why this priority**: Link continuity is where trust drift becomes obvious. If the destination tells a different story, operators stop trusting the product's routing and labels.
**Independent Test**: Can be fully tested by navigating from dashboard KPIs, attention items, recent operations, and operation notifications into admin and system monitoring destinations, then verifying that the originating problem class is visible and recoverable on arrival.
**Acceptance Scenarios**:
1. **Given** a notification frames a run as needing terminal follow-up, **When** the operator opens the linked destination, **Then** the destination visibly confirms that terminal-problem framing.
2. **Given** a dashboard or workspace attention link frames a run as stale active attention, **When** the operator opens the monitoring destination, **Then** the destination visibly confirms the stale active problem class instead of a generic mixed bucket.
### Edge Cases
- A run may move from `likely_stale` to `reconciled_failed` while an operator keeps a local progress surface open; the UI must not continue showing healthy activity after reconciliation.
- A run may be removed from the active stuck list after reconciliation; the system truth chain must still expose that it was recently stale or auto-reconciled rather than making the issue disappear.
- A run may be terminal with a poor outcome and also belong to an artifact-heavy domain; the page must not bury the lifecycle answer behind artifact diagnostics.
- A tenant-scoped summary may link into the canonical operations hub while tenant context is stale or absent; the destination must preserve the correct tenant-safe problem filter or fall back to workspace-safe scope without changing the run's problem class.
- Notifications may be generated before a stale run is later reconciled; entry-point language and destinations must not stay calmer than the current run truth.
- Run types may differ in artifact richness, but none may diverge on the base question of fresh active, likely stale, reconciled, or terminal follow-up.
## Requirements *(mandatory)*
**Constitution alignment (required):** This feature introduces no new Microsoft Graph calls, no new write workflow, no new queued operation type, and no new persisted operations record. It hardens the truth alignment of existing operations and monitoring surfaces over existing `OperationRun`, freshness, and reconciliation semantics.
**Constitution alignment (PROP-001 / ABSTR-001 / PERSIST-001 / STATE-001 / BLOAT-001):** This feature stays deliberately narrow. It adds no new persistence, no new lifecycle table, no new orchestration layer, and no new enum family. The only new semantic split is a derived operator-facing distinction between terminal follow-up and active stale or stuck attention, built from existing status, outcome, freshness, and reconciliation truth.
**Constitution alignment (OPS-UX):** Existing `OperationRun` records remain subject to the three-surface feedback contract. Toasts remain intent-only. Active awareness remains on allowed progress and monitoring surfaces only. Terminal state transitions remain service-owned. This feature may change how active progress surfaces refresh and how summaries classify runs, but it must not add ad-hoc status mutation or a second terminal lifecycle model. Summary counts remain numeric-only and scheduled or system-run notification rules remain unchanged. Regression coverage MUST prove progress freshness, truth alignment, and reconciled visibility without reintroducing direct state mutation on render surfaces.
**Constitution alignment (RBAC-UX):** This feature spans the admin plane and the platform plane. Admin-plane tenant and workspace surfaces continue to use deny-as-not-found for non-members or non-entitled users, and canonical operation routes continue to authorize from workspace and tenant entitlement before revealing run truth. Platform-plane system monitoring continues to rely on platform capability checks. The feature adds no new mutation, no new destructive action, and no cross-plane bypass. Any in-scope destination affordance that is visible but capability-gated must remain helper-texted or disabled rather than turning into a misleading dead-end link.
**Constitution alignment (OPS-EX-AUTH-001):** Not applicable. Authentication handshake exceptions remain unrelated to operations monitoring and cannot be used to justify stale or reconciled truth drift.
**Constitution alignment (BADGE-001):** Existing centralized semantics for operation status, outcome, freshness, and related attention labels remain authoritative. The feature MUST not allow dashboard widgets, recent-operation surfaces, operations hub rows, or notifications to invent page-local meanings for stale, reconciled, blocked, partial, or failed states.
**Constitution alignment (UI-FIL-001):** The feature reuses existing Filament widgets, tables, detail sections, alerts, tabs, and shared UI primitives. It should strengthen semantic emphasis through existing components and shared mappings, not through page-local markup or a new local status language.
**Constitution alignment (UI-NAMING-001):** The target objects are operations summary buckets, operation rows, run detail labels, and notification or entry-point copy. `Needs follow-up` may remain as an umbrella concept, but operator-facing copy MUST differentiate the two problem classes it currently mixes: terminal follow-up and active stale or stuck attention. Copy MUST not use a generic `blocked` or `needs follow-up` label for a mixed bucket unless the visible sub-class is also made explicit.
**Constitution alignment (UI-CONST-001 / UI-SURF-001 / UI-HARD-001 / UI-EX-001 / UI-REVIEW-001):** Each changed surface keeps one primary inspect or drill-through model. Attention summaries use explicit problem-class destinations. Recent-operation tables keep row-click inspection. The operations hub remains a scan-first registry with explicit problem-class filtering. The canonical detail page remains the highest-trust detail surface. System failed and system stuck lists remain row-click-only registry surfaces. No new destructive action is introduced, and no exception to the action-surface contract is required.
**Constitution alignment (OPSURF-001):** Default-visible content must stay operator-first. Summary surfaces answer whether the operator is dealing with terminal follow-up or active stale attention. The operations hub answers which bucket the operator is in and what it means. Canonical detail answers what happened, whether the run is still active, and what to do next before showing diagnostics. System surfaces answer which platform-visible failure or stuck class is being surfaced without requiring the operator to infer it from raw context.
**Constitution alignment (UI-SEM-001 / LAYER-001 / TEST-TRUTH-001):** Direct mapping from canonical run truth to UI remains preferred. The feature may add a thin derived problem-class split, but it must not create redundant truth across persisted records, presenters, summaries, notifications, and system surfaces. Tests MUST focus on operator-visible consequences: whether the same run tells the same story across surfaces and whether drill-through preserves that story.
**Constitution alignment (Filament Action Surfaces):** The Action Surface Contract remains satisfied. No new View actions, no empty action groups, and no list-level destructive controls are introduced. Changed dashboard and monitoring surfaces remain inspection or drill-through surfaces only. UI-FIL-001 remains satisfied with no exemption.
**Constitution alignment (UX-001 — Layout & Information Architecture):** The canonical run detail page keeps one primary decision zone and must elevate stale or reconciled lifecycle truth inside that decision zone rather than only in side banners or lower sections. Summary surfaces keep operator priority order: problem class first, recency and diagnostics second. Existing tables continue to support search, sort, and filtering on core lifecycle dimensions.
### Functional Requirements
- **FR-178-001**: The system MUST treat canonical `OperationRun` lifecycle and freshness truth as authoritative for every summary, list, detail, and notification surface covered by this feature.
- **FR-178-002**: The same run MUST NOT appear as fresh normal activity on one covered surface and as likely stale, reconciled, or terminal problem truth on another covered surface at the same time.
- **FR-178-003**: Covered admin and system monitoring surfaces MUST use one shared derived lifecycle interpretation that distinguishes at least `fresh_active`, `likely_stale`, `reconciled_failed`, and `terminal_normal` without introducing a new persisted state model.
- **FR-178-004**: Reconciliation behavior and system stuck monitoring MUST remain semantically aligned so stale runs do not disappear from operator truth once they are auto-reconciled.
- **FR-178-005**: Automatically reconciled stale runs MUST remain semantically discoverable for operators on admin monitoring or system monitoring surfaces within one navigation step.
- **FR-178-006**: Bulk operation progress surfaces MUST refresh while relevant active runs exist and MUST stop presenting a run as active once canonical truth becomes terminal or reconciled.
- **FR-178-007**: Bulk operation progress surfaces MUST remove or reclassify terminal or reconciled runs within one refresh cycle even when no new enqueue event occurs.
- **FR-178-008**: Recent Operations surfaces on tenant and workspace pages MUST distinguish fresh active runs, likely stale active runs, and terminal follow-up runs rather than flattening them into generic recency.
- **FR-178-009**: Tenant and workspace attention surfaces MUST separate terminal follow-up from active stale or stuck attention instead of mixing them into one undifferentiated bucket.
- **FR-178-010**: The operations hub MUST expose an explicit monitoring view, filter, or tab for active but likely stale runs and an explicit view, filter, or tab for terminal follow-up runs.
- **FR-178-011**: Dashboard, attention, KPI, and recent-operation drill-throughs into the operations hub MUST preserve the originating problem class in visible destination framing.
- **FR-178-012**: The canonical run detail page MUST present stale, reconciled, and terminal-problem lifecycle truth inside the primary decision zone rather than only in secondary banners, side panels, or lower diagnostic sections.
- **FR-178-013**: For likely stale and reconciled runs, the primary decision zone MUST answer whether the run is still active, whether automatic reconciliation already happened, and what the primary next step is.
- **FR-178-014**: Local summary and progress surfaces MUST reuse centralized status, outcome, freshness, and problem-class semantics rather than page-local mappings.
- **FR-178-015**: Notification and in-app entry-point language MUST NOT frame a run more calmly than its current lifecycle or freshness truth.
- **FR-178-016**: Cross-links from dashboard KPIs, attention surfaces, recent operations, and notifications MUST land on destination surfaces that visibly confirm the same problem class that initiated the navigation.
- **FR-178-017**: The feature MUST use the existing `OperationRun`, status, outcome, freshness, and reconciliation model without introducing a schema migration, a new persisted lifecycle artifact, or an enum rewrite.
- **FR-178-018**: Run-type differences MAY preserve deeper artifact truth, but they MUST NOT change the base lifecycle answers of fresh active, likely stale, reconciled, or terminal follow-up.
- **FR-178-019**: Regression coverage MUST prove that the same seeded runs are classified consistently across tenant dashboard, workspace overview, operations hub, canonical run detail, and system failed or stuck surfaces.
- **FR-178-020**: Regression coverage MUST prove bulk-progress freshness, reconciliation visibility, drill-through continuity, and decision-zone emphasis for stale or reconciled runs.
## UI Action Matrix *(mandatory when Filament is changed)*
| Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions |
|---|---|---|---|---|---|---|---|---|---|---|
| Tenant dashboard operations attention | `/admin/t/{tenant}` dashboard | none | Explicit problem-class CTA per bucket | none | none | Existing healthy fallback remains read-only reassurance only when no operations issue exists | n/a | n/a | no new audit behavior | Summary surface only; must not render mixed problem buckets |
| Tenant dashboard recent operations | `/admin/t/{tenant}` dashboard | `Open operations` | Row click to canonical operation detail | none | none | Existing diagnostic empty state remains non-primary | n/a | n/a | no new audit behavior | Recency surface; no destructive actions |
| Workspace operations attention | `/admin` workspace overview | none | Explicit problem-class CTA per bucket | none | none | Existing healthy fallback remains read-only reassurance only when no operations issue exists | n/a | n/a | no new audit behavior | Summary surface only; must not render mixed problem buckets |
| Workspace recent operations | `/admin` workspace overview | `Open operations` | Row click to canonical operation detail | none | none | Existing diagnostic empty state remains non-primary | n/a | n/a | no new audit behavior | Recency surface; no destructive actions |
| Operations hub | `/admin/operations` | Filter or tab controls only; no new destructive actions | Full-row click to canonical operation detail | none | none | Existing empty state remains explanatory and filter-aware | n/a | n/a | no new audit behavior | Scan-first registry surface; problem-class filters must align with summary entry points |
| Canonical operation detail | `/admin/operations/{run}` | `Back to operations` plus existing related navigation only | n/a | n/a | n/a | n/a | Existing related navigation only; no new destructive action introduced by this spec | n/a | no new audit behavior | Decision-zone truth is the hardening target |
| System failed operations | `/system/ops/failures` | `Show all operations` | Full-row click to system operation detail | none | none | `Show all operations` | n/a | n/a | no new audit behavior | Must confirm terminal-problem semantics, not generic follow-up |
| System stuck operations | `/system/ops/stuck` | `Show all operations` | Full-row click to system operation detail | none | none | `Show all operations` | n/a | n/a | no new audit behavior | Must preserve stale or reconciled visibility for platform operators |
### Key Entities *(include if feature involves data)*
- **Operation Run**: The canonical operational record whose status, outcome, freshness, and reconciliation context define the authoritative lifecycle truth.
- **Freshness State**: The derived lifecycle interpretation that distinguishes fresh active work, likely stale work, reconciled failure, and normal terminal completion without adding new persisted state.
- **Problem Class**: The operator-facing split between terminal follow-up and active stale or stuck attention, derived from existing lifecycle truth and used to align summary surfaces and drill-throughs.
- **Drill-through Contract**: The promise that a summary count, notification, or attention label can be visibly rediscovered on the destination surface it opens.
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-178-001**: In covered regression scenarios, 100% of runs seeded as `likely_stale` are shown as stale or otherwise problematic on every covered summary and monitoring surface, and 0 are shown as unremarkable fresh activity.
- **SC-178-002**: In covered regression scenarios, 100% of automatically reconciled stale runs remain semantically recoverable for operators through admin monitoring or system monitoring within one navigation step.
- **SC-178-003**: In covered freshness regression scenarios, local progress surfaces stop showing terminal or reconciled runs as active within one refresh cycle and without requiring a new enqueue event.
- **SC-178-004**: In covered navigation regression scenarios, 100% of dashboard, attention, recent-operation, and notification drill-throughs land on destinations whose visible framing matches the originating problem class.
- **SC-178-005**: In operator review on seeded scenarios, an operator can determine within 10 seconds whether the run is fresh active, likely stale, reconciled, or terminal follow-up from every covered entry surface.
- **SC-178-006**: The feature ships without a schema migration, a new persisted lifecycle artifact, or a new status or outcome family.
## Assumptions
- Existing lifecycle freshness and reconciliation semantics from the operation lifecycle guarantees work remain the authoritative base truth for this hardening slice.
- Existing run-detail decision-zone structure remains the correct place to elevate stale and reconciled lifecycle truth.
- Existing tenant and workspace dashboard truth alignment work remains the baseline grammar for admin-plane summary surfaces.
- Existing system operations surface alignment remains the baseline interaction model for `/system/ops/failures` and `/system/ops/stuck`.
## Non-Goals
- Introducing tenant-admin retry or cancel capabilities
- Rebuilding the operations domain, run schema, or lifecycle engine
- Adding a new persisted problem-state model, enum rewrite, or schema migration
- Redesigning all notification behavior across the product
- Performing deep non-governance result-quality analysis for every run type
- Replacing run-type-specific artifact truth with a uniform artifact model
## Dependencies
- Existing operations auto-refresh behavior and active-run polling patterns
- Existing operation lifecycle guarantees, freshness thresholds, and reconciliation behavior
- Existing canonical run detail hierarchy and decision-zone structure
- Existing tenant dashboard and workspace overview truth-alignment semantics
- Existing system operations surface alignment for row-click-only platform monitoring pages
## Risks
- If stale thresholds are too aggressive, legitimate long-running work could be surfaced as stale too early.
- If summary and monitoring surfaces share labels but not the same underlying filter meaning, operators will continue to mistrust drill-throughs.
- If reconciled stale visibility is over-corrected without hierarchy, system surfaces could become noisy instead of trustworthy.
- If local progress polling is too eager, the product could gain freshness at the cost of unnecessary load.
## Definition of Done
Spec 178 is complete when:
- BulkOperationProgress no longer leaves trust-damaging stale residue that keeps terminal or reconciled runs looking active.
- stale or stuck semantics are consistent between lifecycle reconciliation, tenant and workspace summaries, the operations hub, canonical run detail, and system stuck or failure surfaces.
- tenant and workspace summary surfaces visibly separate terminal problem runs from active stale or stuck runs.
- the operations hub no longer distorts dashboard semantics through mixed or misleading tabs, filters, or bucket names.
- the canonical run detail page prioritizes stale or reconciled lifecycle truth inside the primary decision zone.
- cross-surface links preserve the same operator-visible problem class from origin to destination.
- focused regression coverage proves truth alignment, stale visibility, drill-through continuity, and progress freshness.
## Summary
This feature is a late-foundation hardening slice for the operations domain. The underlying lifecycle model is already strong: `OperationRun` is canonical, status and outcome are separated, stale reconciliation exists, system stuck surfaces exist, and canonical run detail already owns the deepest operational truth. The remaining problem is not missing architecture; it is trust drift between surfaces that summarize or relabel that truth.
Spec 178 closes that gap by making every covered surface tell the same story about whether a run is still active, likely stale, already reconciled, or terminal and in need of follow-up. It keeps the model narrow by reusing existing lifecycle and freshness truth, then aligning summaries, live progress, drill-throughs, and decision-zone emphasis so operators do not have to reconcile conflicting screens by hand.