TenantAtlas/specs/233-stale-run-visibility/research.md
2026-04-23 17:06:09 +02:00

5.4 KiB

Research: Operation Run Active-State Visibility & Stale Escalation

Decision 1: Keep lifecycle freshness truth in the existing run model and reconciler

  • Decision: Use OperationRunFreshnessState, OperationRun::freshnessState(), OperationRun::problemClass(), and OperationLifecycleReconciler as the only lifecycle-truth inputs for this feature.
  • Rationale: The application already computes fresh_active, likely_stale, reconciled_failed, terminal_normal, and unknown from the run record plus OperationLifecyclePolicy. Canonical monitoring surfaces already rely on that truth, so adding a second stale heuristic would immediately recreate the drift this spec is trying to remove.
  • Alternatives considered:
    • Add new OperationRun.status values such as stale or late: rejected because the distinction is presentation and triage-oriented, not a new persisted lifecycle state.
    • Add page-local thresholds per widget: rejected because it would create conflicting meaning across tenant, workspace, and canonical monitoring surfaces.

Decision 2: Reuse the existing Ops UX presenter path before introducing a new helper

  • Decision: Prefer OperationUxPresenter::decisionZoneTruth(), lifecycleAttentionSummary(), surfaceGuidance(), and centralized badge rendering as the presentation backbone.
  • Rationale: The code already exposes a derived decision-zone payload and shared stale/reconciled copy. OperationRunStatusBadge already renders Likely stale when queued/running work carries freshness_state=likely_stale, and OperationUxPresenter already provides compact and diagnostic explanations off the same truth.
  • Alternatives considered:
    • New dedicated presenter family for active-state visibility: rejected unless the existing presenter path proves insufficient during implementation.
    • Widget-local copy branches: rejected because they would increase semantic spread and regression risk.

Decision 3: Treat stale-active runs as still active for tenant progress visibility

  • Decision: Change tenant-local active-progress visibility to include freshness-elevated active runs rather than suppressing them via healthyActive().
  • Rationale: BulkOperationProgress and ActiveRuns::existForTenantId() previously used healthyActive(), which caused stale queued/running work to disappear from the tenant progress overlay and stopped polling when only stale runs remained. That was the clearest concrete contradiction with the canonical monitoring surfaces.
  • Alternatives considered:
    • Keep stale runs hidden in the progress overlay and rely on dashboard/list only: rejected because the spec explicitly covers tenant-local active-run cards and progress summaries.
    • Add a separate stale-only overlay: rejected because it would create a second active-work surface family instead of fixing the existing one.

Decision 4: Preserve current surface roles and drill-through flow

  • Decision: Keep the current route and surface model: tenant dashboard and tenant progress remain secondary context, /admin/operations remains the primary triage list, and /admin/operations/{run} remains diagnostic-first.
  • Rationale: Existing links already converge through OperationRunLinks, and current pages/widgets match the constitution's decision-first model. The gap is the honesty of compact active-state messaging, not missing routes.
  • Alternatives considered:
    • New operations hub or new tenant-local detail page: rejected as unnecessary workflow expansion.
    • New notification channel for stale active work: rejected because the spec explicitly excludes new notification behavior.

Decision 5: Extend existing focused tests and invert stale-hidden assumptions where necessary

  • Decision: Update existing monitoring, Filament, and Ops UX tests rather than creating a new broad suite.
  • Rationale: The repository already has focused coverage for lifecycle presentation and tenant progress behavior. In particular, BulkOperationProgressDbOnlyTest and ProgressWidgetFiltersTest currently codify the stale-hidden behavior that this feature must deliberately replace.
  • Alternatives considered:
    • Add a brand-new browser suite: rejected because feature tests already cover the underlying business truth and UI copy.
    • Leave old progress-widget tests untouched and add parallel tests: rejected because the old assertions would preserve the wrong contract.

Decision 6: Keep “past expected lifecycle” and “likely stale” as density-specific labels over the same stale truth

  • Decision: Model compact “past expected lifecycle” phrasing and stronger “likely stale” diagnostic phrasing as different density outputs over the same likely_stale freshness truth rather than as separate persisted states.
  • Rationale: The spec allows same meaning, different density. The current code already points in that direction: OperationUxPresenter::surfaceGuidance() says the run is “past its lifecycle window,” while OperationRunStatusBadge can label the same run Likely stale.
  • Alternatives considered:
    • Create two separate freshness states for “late” and “likely stale”: rejected because existing lifecycle truth has only one stale boundary and no additional behavioral consequence.
    • Collapse all stale-active copy to a single label everywhere: rejected because compact surfaces and canonical detail need different density without changing meaning.