TenantAtlas/specs/242-operational-controls/research.md
ahmido d96abc65fb
Some checks failed
Main Confidence / confidence (push) Failing after 1m23s
Remove Findings lifecycle backfill operational surface (controls slice) (#280)
Removes the Findings lifecycle backfill from the Operational Controls UI and OperationalControlCatalog.

This patch is a safe, controls-only change; runbooks, jobs and other runtime artifacts are NOT removed yet. Follow-up work will delete the runbook service/scope, jobs, commands, and update tests.

Files changed:
- apps/platform/app/Filament/System/Pages/Ops/Controls.php
- apps/platform/app/Support/OperationalControls/OperationalControlCatalog.php
- apps/platform/tests/Feature/System/OpsControls/OperationalControlManagementTest.php
- apps/platform/tests/Unit/Support/OperationalControls/OperationalControlCatalogTest.php
- apps/platform/tests/Unit/Support/OperationalControls/OperationalControlScopeResolutionTest.php

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #280
2026-04-26 15:43:47 +00:00

8.5 KiB
Raw Blame History

Research — Operational Controls

Date: 2026-04-26
Spec: spec.md

This document captures design decisions and supporting rationale for the first operational-controls slice. All decisions are grounded in current repository truth and the TenantPilot Constitution.

Decision 1 — Persist only active pause records, derive the enabled state, and let global pauses win

Decision: Store only explicit active control activations that pause a control. Do not persist enabled rows or a broader multi-state lifecycle. The effective enabled state is derived from the absence of an active matching activation, and a matching global pause wins over a narrower workspace pause in v1.

Rationale:

  • The operator problem is safe runtime pause control, not a new workflow state machine.
  • Constitution PERSIST-001 and STATE-001 favor the smallest persisted truth that changes behavior.
  • Deriving enabled avoids importing a second layer of default-state maintenance.
  • Global-first precedence is the safest bounded rule because a platform-wide incident pause must not be narrowed by a workspace-specific row in this first slice.

Evidence:

  • The current code gap is an env-gated yes/no maintenance switch in apps/platform/app/Filament/Resources/FindingResource/Pages/ListFindings.php.
  • The first slice only needs to answer one question at execution time: may this action start right now for this scope?
  • The first slice does not support workspace-specific allow overrides, so no narrower row should reopen a globally paused control.

Alternatives considered:

  • Persist both enabled and paused rows.
    • Rejected: unnecessary state duplication; absence of an active pause already means enabled.
  • Add a larger status family such as draft, scheduled, paused, forced, emergency.
    • Rejected: too broad for current-release truth.

Decision 2 — Use one platform-operated activation table instead of env flags or workspace settings

Decision: Introduce one platform-operated operational_control_activations table that can represent either a global pause or a workspace-scoped pause. Do not split truth across env flags, platform config, and workspace settings.

Rationale:

  • The spec requires one auditable control truth across system and tenant surfaces.
  • Existing workspace settings infrastructure is workspace-only and cannot represent one global platform-wide safety state cleanly.
  • Env flags are invisible product truth and require deploy-time coordination.

Evidence:

  • Existing workspace settings writer only manages workspace-scoped settings in apps/platform/app/Services/Settings/SettingsWriter.php.
  • The current env gate lives in apps/platform/config/tenantpilot.php and is consumed directly in ListFindings.

Alternatives considered:

  • Reuse workspace settings for workspace overrides and keep a global env flag.
    • Rejected: split truth, inconsistent audit semantics, and no single effective-state evaluator.
  • Use env flags only.
    • Rejected: not operator-visible or auditable in-product.

Decision 3 — Evaluate controls at the start seam, not only in UI visibility

Decision: Integrate control evaluation at the concrete start seams that already own execution decisions: FindingsLifecycleBackfillRunbookService::start() for all findings lifecycle backfill callers, and queued restore execution before OperationRun or provider dispatch begins.

Rationale:

  • UI-only hiding would fail the safety requirement because direct requests or stale page state could still start execution.
  • The repo already has clear start seams where action or service logic decides whether a run begins.
  • This keeps blocked-state truth server-side and shared.

Evidence:

  • Findings lifecycle backfill starts in apps/platform/app/Services/Runbooks/FindingsLifecycleBackfillRunbookService.php and is called from the system runbooks page, tenant findings page, tenantpilot:findings:backfill-lifecycle, and tenantpilot:run-deploy-runbooks.
  • Restore execution starts in apps/platform/app/Filament/Resources/RestoreRunResource.php and already routes provider-backed starts through apps/platform/app/Services/Providers/ProviderOperationStartGate.php.

Alternatives considered:

  • Hide or disable actions in UI only.
    • Rejected: violates the server-side enforcement requirement.

Decision 4 — Add one system ops controls page instead of surface-local toggles

Decision: Manage the first-slice controls from one dedicated system ops page under /system/ops/controls. Do not add per-page toggles or bury control changes inside each affected surface. The page shows effective-state summaries by default and exposes change history through on-demand audit links instead of creating a second history surface.

Rationale:

  • Operators need one place to make the runtime-safety decision itself.
  • Constitution DECIDE-001 and the specs decision-role table require a primary decision surface for control management.
  • A shared control center prevents drift between runbooks, findings, and restore surfaces.

Evidence:

  • The repo already groups ops surfaces under apps/platform/app/Filament/System/Pages/Ops/.
  • Existing runbooks and run viewers are already system-plane ops surfaces, so a sibling controls page fits the current information architecture.

Alternatives considered:

  • Add a toggle to the runbooks page only.
    • Rejected: restore execution is not owned by that page and the control decision would stay fragmented.

Decision 5 — Break-glass does not bypass operational controls in v1

Decision: Break-glass sessions do not automatically bypass active operational controls in the first slice.

Rationale:

  • Operational controls are introduced as runtime-safety truth, not as optional UI friction.
  • An implicit bypass would make incident behavior ambiguous and weaken auditability.
  • The first slice stays safer by forcing an explicit resume action before execution.

Evidence:

  • The system runbook page already has break-glass-aware reason requirements via BreakGlassSession, but operational controls are a distinct safety layer.

Alternatives considered:

  • Let break-glass ignore controls.
    • Rejected: too risky for v1 and not required by current operator pain.

Decision 6 — Reuse existing audit and start-result helpers, but keep global audits platform-scoped

Decision: Keep workspace-targeted changes and blocked execution evidence with concrete workspace or tenant context on WorkspaceAuditLogger plus AuditActionId, but record global control changes and blocked system-plane all-tenant attempts through AuditRecorder directly so they stay platform-plane events without false workspace ownership. Include requested-scope metadata on those platform-plane blocked attempts. Keep blocked/allowed execution messaging on the existing operation/provider start-result helpers.

Rationale:

  • Constitution XCUT-001 requires reuse of existing shared interaction paths.
  • The repo already has shared primitives for queued toasts, dedupe messaging, and audit summaries.
  • This avoids a second language for blocked execution.
  • WorkspaceAuditLogger requires a Workspace, while AuditRecorder already supports null workspace and null tenant for truthful system-plane events.

Evidence:

  • Audit logging lives in apps/platform/app/Services/Audit/WorkspaceAuditLogger.php.
  • Global system-plane audit support lives in apps/platform/app/Services/Audit/AuditRecorder.php.
  • Canonical audit IDs live in apps/platform/app/Support/Audit/AuditActionId.php.
  • Provider-backed start messaging already routes through ProviderOperationStartResultPresenter and OperationUxPresenter.

Alternatives considered:

  • Emit page-local notifications and free-form audit action strings.
    • Rejected: immediate drift risk and weaker reviewability.

Decision 7 — Proof stays in Unit + Feature lanes only

Decision: Keep proof in focused unit and feature tests. Do not introduce browser tests or heavy-governance coverage for this first slice.

Rationale:

  • The business truth is effective-state evaluation, audit recording, and blocked/no-side-effect execution.
  • Browser coverage would mostly duplicate existing Filament modal behavior.
  • Constitution TEST-GOV-001 requires the narrowest proving lane mix.

Evidence:

  • Existing system runbooks and restore features already have focused feature coverage patterns in the repo.
  • The new logic is server-side and deterministic.

Alternatives considered:

  • Add browser smoke for pause/resume flows.
    • Rejected: not needed to prove the core runtime-safety semantics of this slice.