TenantAtlas/specs/178-ops-truth-alignment/research.md
2026-04-05 23:40:45 +02:00

7.0 KiB

Phase 0 Research: Operations Lifecycle Alignment & Cross-Surface Truth Consistency

Decision: Reuse the existing freshness and reconciliation model as the lifecycle base

Rationale: The repo already has OperationRunFreshnessState, OperationLifecyclePolicy, likelyStale() model scope logic, and context.reconciliation metadata. The trust problem is not missing lifecycle truth. It is that different surfaces summarize or surface that truth differently. Reusing the current freshness/reconciliation model is narrower and keeps Spec 178 aligned with Spec 160 instead of creating a competing lifecycle layer.

Alternatives considered:

  • Create a new persisted problem-state field on operation_runs: rejected because the feature is explicitly scoped away from schema change and because the current operator problem is cross-surface drift, not missing stored truth.
  • Introduce a second enum family for problem class: rejected because the current freshness and outcome model already provides enough signal for a derived split.

Decision: Introduce a thin derived split between terminal follow-up and active stale/stuck attention

Rationale: The current main mixing point is OperationRun::dashboardNeedsFollowUp(), which ORs terminal failures and likelyStale() runs into one bucket. The narrowest correction is to derive two explicit attention classes from existing outcome and freshness truth, not to create a dashboard or monitoring framework.

Alternatives considered:

  • Keep needs follow-up as the only bucket and add more text: rejected because the spec requires operators to distinguish active stale from terminal problems without guesswork.
  • Build a new cross-domain attention taxonomy: rejected as disproportionate for a monitoring-only hardening slice.

Decision: Extend existing model/query and presenter seams instead of introducing a new helper framework

Rationale: Existing seams already own most of the relevant logic: OperationRun scopes own query truth, OperationUxPresenter owns operator-facing guidance and notification wording, ActiveRuns already controls polling on several surfaces, and OperationRunLinks already owns canonical navigation. Extending those seams keeps the change local and consistent with repo bias.

Alternatives considered:

  • Add a new reusable classification service or taxonomy registry: rejected because the slice only needs a thin derived split and would risk introducing a semantic framework the constitution explicitly disfavors.
  • Push all logic into widget-local queries: rejected because that would preserve or worsen the existing drift.

Decision: Apply the repo's conditional polling pattern to BulkOperationProgress

Rationale: Dashboard and recent-operation widgets already poll only while active runs exist. BulkOperationProgress currently refreshes its run list but does not share the same disciplined poll gate. Extending the existing 10-second active-run polling pattern to that component is the narrowest way to remove stale live-progress residue.

Alternatives considered:

  • Keep the overlay event-driven only: rejected because the spec explicitly identifies stale UI residue after canonical truth changes without a new enqueue event.
  • Add aggressive always-on polling: rejected because other repo surfaces already established a conditional active-run-only approach.

Decision: Keep BulkOperationProgress as an active-only surface

Rationale: The overlay's job is to tell the operator about currently active work. Once a run becomes terminal or reconciled, the canonical follow-up story belongs on recent operations, attention surfaces, the monitoring hub, and detail pages. Removing terminal/reconciled runs from the overlay is narrower and preserves surface roles.

Alternatives considered:

  • Reclassify terminal or reconciled runs inside the overlay: rejected because it would turn the overlay into a second summary/attention widget with overlapping semantics.
  • Leave terminal/reconciled runs visible until manual refresh: rejected because it directly violates the spec's trust objective.

Decision: Keep /admin/operations as the sole canonical collection route and preserve problem class through filter state

Rationale: The repo already standardized canonical operations navigation on /admin/operations and /admin/operations/{run}. Dashboard, workspace, and notification entry points should carry tenant-safe problem-class filter or tab state into that route instead of opening new collection pages.

Alternatives considered:

  • Add tenant-specific operations collection routes: rejected because the repo already treats canonical operations as workspace-context routes with tenant-safe filtering.
  • Keep raw route links without filter continuity: rejected because that is the current trust-drift problem.

Decision: Keep /system/ops/stuck focused on active stale candidates and surface reconciled stale lineage elsewhere in system monitoring

Rationale: The Stuck page is already a clean read-only registry of queued/running runs that crossed the lifecycle threshold. Widening it to include completed reconciled runs would blur that page's purpose. The narrower fix is to keep Stuck active-only while ensuring /system/ops/runs, /system/ops/failures, and system detail explicitly reveal when a failed or completed run was auto-reconciled from a stale lifecycle state.

Alternatives considered:

  • Add completed reconciled runs directly to /system/ops/stuck: rejected because it would collapse active-stuck and terminal-history semantics into one list.
  • Leave reconciled stale lineage visible only on admin surfaces: rejected because Spec 178 requires the system truth chain to remain consistent too.

Decision: Promote stale/reconciled lifecycle truth through the existing canonical decision-zone seam

Rationale: Spec 164 already established the canonical run detail as a decision-first surface. The right fix is to strengthen that existing decision/current-state zone so it answers active vs reconciled vs terminal clearly. A new banner framework or a separate detail card would just reintroduce duplicated truth.

Alternatives considered:

  • Add more banners above the current page: rejected because it would duplicate lifecycle emphasis rather than integrating it into the canonical decision hierarchy.
  • Push the lifecycle answer into diagnostics only: rejected because the spec explicitly disallows that.

Decision: Keep notification hardening on the existing OperationRunCompleted path

Rationale: Terminal notifications already route through OperationUxPresenter::terminalDatabaseNotification(). Extending that path to preserve stale/reconciled problem-class wording is the narrowest way to keep entry points aligned with canonical truth.

Alternatives considered:

  • Introduce a new notification class for reconciled or stale-derived terminal runs: rejected because the spec explicitly avoids a notification redesign.
  • Leave notifications unchanged and rely on the destination page only: rejected because entry-point semantics must not be calmer than the current truth.