TenantAtlas/specs/178-ops-truth-alignment/research.md

# Phase 0 Research: Operations Lifecycle Alignment & Cross-Surface Truth Consistency

## Decision: Reuse the existing freshness and reconciliation model as the lifecycle base

**Rationale**: The repo already has `OperationRunFreshnessState`, `OperationLifecyclePolicy`, `likelyStale()` model scope logic, and `context.reconciliation` metadata. The trust problem is not missing lifecycle truth. It is that different surfaces summarize or surface that truth differently. Reusing the current freshness/reconciliation model is narrower and keeps Spec 178 aligned with Spec 160 instead of creating a competing lifecycle layer.

**Alternatives considered**:
- Create a new persisted problem-state field on `operation_runs`: rejected because the feature is explicitly scoped away from schema change and because the current operator problem is cross-surface drift, not missing stored truth.
- Introduce a second enum family for `problem class`: rejected because the current freshness and outcome model already provides enough signal for a derived split.

## Decision: Introduce a thin derived split between `terminal follow-up` and `active stale/stuck attention`

**Rationale**: The current main mixing point is `OperationRun::dashboardNeedsFollowUp()`, which ORs terminal failures and `likelyStale()` runs into one bucket. The narrowest correction is to derive two explicit attention classes from existing outcome and freshness truth, not to create a dashboard or monitoring framework.

**Alternatives considered**:
- Keep `needs follow-up` as the only bucket and add more text: rejected because the spec requires operators to distinguish active stale from terminal problems without guesswork.
- Build a new cross-domain attention taxonomy: rejected as disproportionate for a monitoring-only hardening slice.

## Decision: Extend existing model/query and presenter seams instead of introducing a new helper framework

**Rationale**: Existing seams already own most of the relevant logic: `OperationRun` scopes own query truth, `OperationUxPresenter` owns operator-facing guidance and notification wording, `ActiveRuns` already controls polling on several surfaces, and `OperationRunLinks` already owns canonical navigation. Extending those seams keeps the change local and consistent with repo bias.

**Alternatives considered**:
- Add a new reusable classification service or taxonomy registry: rejected because the slice only needs a thin derived split and would risk introducing a semantic framework the constitution explicitly disfavors.
- Push all logic into widget-local queries: rejected because that would preserve or worsen the existing drift.

## Decision: Apply the repo's conditional polling pattern to `BulkOperationProgress`

**Rationale**: Dashboard and recent-operation widgets already poll only while active runs exist. `BulkOperationProgress` currently refreshes its run list but does not share the same disciplined poll gate. Extending the existing 10-second active-run polling pattern to that component is the narrowest way to remove stale live-progress residue.

**Alternatives considered**:
- Keep the overlay event-driven only: rejected because the spec explicitly identifies stale UI residue after canonical truth changes without a new enqueue event.
- Add aggressive always-on polling: rejected because other repo surfaces already established a conditional active-run-only approach.

## Decision: Keep `BulkOperationProgress` as an active-only surface

**Rationale**: The overlay's job is to tell the operator about currently active work. Once a run becomes terminal or reconciled, the canonical follow-up story belongs on recent operations, attention surfaces, the monitoring hub, and detail pages. Removing terminal/reconciled runs from the overlay is narrower and preserves surface roles.

**Alternatives considered**:
- Reclassify terminal or reconciled runs inside the overlay: rejected because it would turn the overlay into a second summary/attention widget with overlapping semantics.
- Leave terminal/reconciled runs visible until manual refresh: rejected because it directly violates the spec's trust objective.

## Decision: Keep `/admin/operations` as the sole canonical collection route and preserve problem class through filter state

**Rationale**: The repo already standardized canonical operations navigation on `/admin/operations` and `/admin/operations/{run}`. Dashboard, workspace, and notification entry points should carry tenant-safe problem-class filter or tab state into that route instead of opening new collection pages.

**Alternatives considered**:
- Add tenant-specific operations collection routes: rejected because the repo already treats canonical operations as workspace-context routes with tenant-safe filtering.
- Keep raw route links without filter continuity: rejected because that is the current trust-drift problem.

## Decision: Keep `/system/ops/stuck` focused on active stale candidates and surface reconciled stale lineage elsewhere in system monitoring

**Rationale**: The `Stuck` page is already a clean read-only registry of queued/running runs that crossed the lifecycle threshold. Widening it to include completed reconciled runs would blur that page's purpose. The narrower fix is to keep `Stuck` active-only while ensuring `/system/ops/runs`, `/system/ops/failures`, and system detail explicitly reveal when a failed or completed run was auto-reconciled from a stale lifecycle state.

**Alternatives considered**:
- Add completed reconciled runs directly to `/system/ops/stuck`: rejected because it would collapse active-stuck and terminal-history semantics into one list.
- Leave reconciled stale lineage visible only on admin surfaces: rejected because Spec 178 requires the system truth chain to remain consistent too.

## Decision: Promote stale/reconciled lifecycle truth through the existing canonical decision-zone seam

**Rationale**: Spec 164 already established the canonical run detail as a decision-first surface. The right fix is to strengthen that existing decision/current-state zone so it answers active vs reconciled vs terminal clearly. A new banner framework or a separate detail card would just reintroduce duplicated truth.

**Alternatives considered**:
- Add more banners above the current page: rejected because it would duplicate lifecycle emphasis rather than integrating it into the canonical decision hierarchy.
- Push the lifecycle answer into diagnostics only: rejected because the spec explicitly disallows that.

## Decision: Keep notification hardening on the existing `OperationRunCompleted` path

**Rationale**: Terminal notifications already route through `OperationUxPresenter::terminalDatabaseNotification()`. Extending that path to preserve stale/reconciled problem-class wording is the narrowest way to keep entry points aligned with canonical truth.

**Alternatives considered**:
- Introduce a new notification class for reconciled or stale-derived terminal runs: rejected because the spec explicitly avoids a notification redesign.
- Leave notifications unchanged and rely on the destination page only: rejected because entry-point semantics must not be calmer than the current truth.