TenantAtlas/specs/178-ops-truth-alignment/research.md
ahmido 1142d283eb feat: Spec 178 — Operations Lifecycle Alignment & Cross-Surface Truth Consistency (#209)
## Spec 178 — Operations Lifecycle Alignment & Cross-Surface Truth Consistency

Härtet die Run-Lifecycle-Wahrheit und Cross-Surface-Konsistenz über alle zentralen Operator-Flächen hinweg.

### Kern-Änderungen

**Lifecycle Truth Alignment**
- Einheitliche stale/stuck-Semantik zwischen Tenant-, Workspace-, Admin- und System-Surfaces
- `OperationRunFreshnessState` wird konsistent über alle Widgets und Seiten propagiert
- Gemeinsame Problem-Klassen-Trennung: `terminal_follow_up` vs. `active_stale_attention`

**BulkOperationProgress Freshness**
- Overlay zeigt nur noch `healthyActive()` Runs statt alle aktiven Runs
- Likely-stale Runs halten das Polling nicht mehr künstlich aktiv
- Terminal Runs verschwinden zeitnah aus dem Progress-Overlay

**Decision Zone im Run Detail**
- Stale/reconciled Attention in der primären Decision-Hierarchie
- Klare Antworten: aktiv? stale? reconciled? nächster Schritt?
- Artifact-reiche Runs behalten Lifecycle-Truth vor Deep-Diagnostics

**Cross-Surface Link-Continuity**
- Dashboard → Operations Hub → Run Detail erzählen dieselbe Geschichte
- Notifications referenzieren korrekte Problem-Klasse
- Workspace/Tenant-Attention verlinken problemklassengerecht

**System-Plane Fixes**
- `/system/ops/failures` 500-Error behoben (panel-sichere Artifact-URLs)
- System-Stuck/Failures zeigen reconciled stale lineage

### Weitere Fixes
- Inventory auth guard bereinigt (Gate statt ad-hoc Facades)
- Browser-Smoke-Tests stabilisiert (DOM-Assertions statt fragile Klicks)
- Test-Assertion-Drift für Verification/Lifecycle-Texte korrigiert

### Test-Ergebnis
Full Suite: **3269 passed**, 8 skipped, 0 failed

### Spec-Artefakte
- `specs/178-ops-truth-alignment/spec.md`
- `specs/178-ops-truth-alignment/plan.md`
- `specs/178-ops-truth-alignment/tasks.md`
- `specs/178-ops-truth-alignment/research.md`
- `specs/178-ops-truth-alignment/data-model.md`
- `specs/178-ops-truth-alignment/quickstart.md`
- `specs/178-ops-truth-alignment/contracts/operations-truth-alignment.openapi.yaml`

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #209
2026-04-05 22:42:24 +00:00

73 lines
7.0 KiB
Markdown

# Phase 0 Research: Operations Lifecycle Alignment & Cross-Surface Truth Consistency
## Decision: Reuse the existing freshness and reconciliation model as the lifecycle base
**Rationale**: The repo already has `OperationRunFreshnessState`, `OperationLifecyclePolicy`, `likelyStale()` model scope logic, and `context.reconciliation` metadata. The trust problem is not missing lifecycle truth. It is that different surfaces summarize or surface that truth differently. Reusing the current freshness/reconciliation model is narrower and keeps Spec 178 aligned with Spec 160 instead of creating a competing lifecycle layer.
**Alternatives considered**:
- Create a new persisted problem-state field on `operation_runs`: rejected because the feature is explicitly scoped away from schema change and because the current operator problem is cross-surface drift, not missing stored truth.
- Introduce a second enum family for `problem class`: rejected because the current freshness and outcome model already provides enough signal for a derived split.
## Decision: Introduce a thin derived split between `terminal follow-up` and `active stale/stuck attention`
**Rationale**: The current main mixing point is `OperationRun::dashboardNeedsFollowUp()`, which ORs terminal failures and `likelyStale()` runs into one bucket. The narrowest correction is to derive two explicit attention classes from existing outcome and freshness truth, not to create a dashboard or monitoring framework.
**Alternatives considered**:
- Keep `needs follow-up` as the only bucket and add more text: rejected because the spec requires operators to distinguish active stale from terminal problems without guesswork.
- Build a new cross-domain attention taxonomy: rejected as disproportionate for a monitoring-only hardening slice.
## Decision: Extend existing model/query and presenter seams instead of introducing a new helper framework
**Rationale**: Existing seams already own most of the relevant logic: `OperationRun` scopes own query truth, `OperationUxPresenter` owns operator-facing guidance and notification wording, `ActiveRuns` already controls polling on several surfaces, and `OperationRunLinks` already owns canonical navigation. Extending those seams keeps the change local and consistent with repo bias.
**Alternatives considered**:
- Add a new reusable classification service or taxonomy registry: rejected because the slice only needs a thin derived split and would risk introducing a semantic framework the constitution explicitly disfavors.
- Push all logic into widget-local queries: rejected because that would preserve or worsen the existing drift.
## Decision: Apply the repo's conditional polling pattern to `BulkOperationProgress`
**Rationale**: Dashboard and recent-operation widgets already poll only while active runs exist. `BulkOperationProgress` currently refreshes its run list but does not share the same disciplined poll gate. Extending the existing 10-second active-run polling pattern to that component is the narrowest way to remove stale live-progress residue.
**Alternatives considered**:
- Keep the overlay event-driven only: rejected because the spec explicitly identifies stale UI residue after canonical truth changes without a new enqueue event.
- Add aggressive always-on polling: rejected because other repo surfaces already established a conditional active-run-only approach.
## Decision: Keep `BulkOperationProgress` as an active-only surface
**Rationale**: The overlay's job is to tell the operator about currently active work. Once a run becomes terminal or reconciled, the canonical follow-up story belongs on recent operations, attention surfaces, the monitoring hub, and detail pages. Removing terminal/reconciled runs from the overlay is narrower and preserves surface roles.
**Alternatives considered**:
- Reclassify terminal or reconciled runs inside the overlay: rejected because it would turn the overlay into a second summary/attention widget with overlapping semantics.
- Leave terminal/reconciled runs visible until manual refresh: rejected because it directly violates the spec's trust objective.
## Decision: Keep `/admin/operations` as the sole canonical collection route and preserve problem class through filter state
**Rationale**: The repo already standardized canonical operations navigation on `/admin/operations` and `/admin/operations/{run}`. Dashboard, workspace, and notification entry points should carry tenant-safe problem-class filter or tab state into that route instead of opening new collection pages.
**Alternatives considered**:
- Add tenant-specific operations collection routes: rejected because the repo already treats canonical operations as workspace-context routes with tenant-safe filtering.
- Keep raw route links without filter continuity: rejected because that is the current trust-drift problem.
## Decision: Keep `/system/ops/stuck` focused on active stale candidates and surface reconciled stale lineage elsewhere in system monitoring
**Rationale**: The `Stuck` page is already a clean read-only registry of queued/running runs that crossed the lifecycle threshold. Widening it to include completed reconciled runs would blur that page's purpose. The narrower fix is to keep `Stuck` active-only while ensuring `/system/ops/runs`, `/system/ops/failures`, and system detail explicitly reveal when a failed or completed run was auto-reconciled from a stale lifecycle state.
**Alternatives considered**:
- Add completed reconciled runs directly to `/system/ops/stuck`: rejected because it would collapse active-stuck and terminal-history semantics into one list.
- Leave reconciled stale lineage visible only on admin surfaces: rejected because Spec 178 requires the system truth chain to remain consistent too.
## Decision: Promote stale/reconciled lifecycle truth through the existing canonical decision-zone seam
**Rationale**: Spec 164 already established the canonical run detail as a decision-first surface. The right fix is to strengthen that existing decision/current-state zone so it answers active vs reconciled vs terminal clearly. A new banner framework or a separate detail card would just reintroduce duplicated truth.
**Alternatives considered**:
- Add more banners above the current page: rejected because it would duplicate lifecycle emphasis rather than integrating it into the canonical decision hierarchy.
- Push the lifecycle answer into diagnostics only: rejected because the spec explicitly disallows that.
## Decision: Keep notification hardening on the existing `OperationRunCompleted` path
**Rationale**: Terminal notifications already route through `OperationUxPresenter::terminalDatabaseNotification()`. Extending that path to preserve stale/reconciled problem-class wording is the narrowest way to keep entry points aligned with canonical truth.
**Alternatives considered**:
- Introduce a new notification class for reconciled or stale-derived terminal runs: rejected because the spec explicitly avoids a notification redesign.
- Leave notifications unchanged and rely on the destination page only: rejected because entry-point semantics must not be calmer than the current truth.