TenantAtlas/specs/242-operational-controls/research.md

# Research — Operational Controls

**Date**: 2026-04-26
**Spec**: [spec.md](spec.md)

This document captures design decisions and supporting rationale for the first operational-controls slice. All decisions are grounded in current repository truth and the TenantPilot Constitution.

## Decision 1 — Persist only active pause records, derive the enabled state, and let global pauses win

**Decision**: Store only explicit active control activations that pause a control. Do not persist `enabled` rows or a broader multi-state lifecycle. The effective `enabled` state is derived from the absence of an active matching activation, and a matching global pause wins over a narrower workspace pause in v1.

**Rationale**:
- The operator problem is safe runtime pause control, not a new workflow state machine.
- Constitution `PERSIST-001` and `STATE-001` favor the smallest persisted truth that changes behavior.
- Deriving `enabled` avoids importing a second layer of default-state maintenance.
- Global-first precedence is the safest bounded rule because a platform-wide incident pause must not be narrowed by a workspace-specific row in this first slice.

**Evidence**:
- The current code gap is an env-gated yes/no maintenance switch in `apps/platform/app/Filament/Resources/FindingResource/Pages/ListFindings.php`.
- The first slice only needs to answer one question at execution time: may this action start right now for this scope?
- The first slice does not support workspace-specific allow overrides, so no narrower row should reopen a globally paused control.

**Alternatives considered**:
- Persist both `enabled` and `paused` rows.
  - Rejected: unnecessary state duplication; absence of an active pause already means enabled.
- Add a larger status family such as draft, scheduled, paused, forced, emergency.
  - Rejected: too broad for current-release truth.

## Decision 2 — Use one platform-operated activation table instead of env flags or workspace settings

**Decision**: Introduce one platform-operated `operational_control_activations` table that can represent either a global pause or a workspace-scoped pause. Do not split truth across env flags, platform config, and workspace settings.

**Rationale**:
- The spec requires one auditable control truth across system and tenant surfaces.
- Existing workspace settings infrastructure is workspace-only and cannot represent one global platform-wide safety state cleanly.
- Env flags are invisible product truth and require deploy-time coordination.

**Evidence**:
- Existing workspace settings writer only manages workspace-scoped settings in `apps/platform/app/Services/Settings/SettingsWriter.php`.
- The current env gate lives in `apps/platform/config/tenantpilot.php` and is consumed directly in `ListFindings`.

**Alternatives considered**:
- Reuse workspace settings for workspace overrides and keep a global env flag.
  - Rejected: split truth, inconsistent audit semantics, and no single effective-state evaluator.
- Use env flags only.
  - Rejected: not operator-visible or auditable in-product.

## Decision 3 — Evaluate controls at the start seam, not only in UI visibility

**Decision**: Integrate control evaluation at the concrete start seams that already own execution decisions: `FindingsLifecycleBackfillRunbookService::start()` for all findings lifecycle backfill callers, and queued restore execution before `OperationRun` or provider dispatch begins.

**Rationale**:
- UI-only hiding would fail the safety requirement because direct requests or stale page state could still start execution.
- The repo already has clear start seams where action or service logic decides whether a run begins.
- This keeps blocked-state truth server-side and shared.

**Evidence**:
- Findings lifecycle backfill starts in `apps/platform/app/Services/Runbooks/FindingsLifecycleBackfillRunbookService.php` and is called from the system runbooks page, tenant findings page, `tenantpilot:findings:backfill-lifecycle`, and `tenantpilot:run-deploy-runbooks`.
- Restore execution starts in `apps/platform/app/Filament/Resources/RestoreRunResource.php` and already routes provider-backed starts through `apps/platform/app/Services/Providers/ProviderOperationStartGate.php`.

**Alternatives considered**:
- Hide or disable actions in UI only.
  - Rejected: violates the server-side enforcement requirement.

## Decision 4 — Add one system ops controls page instead of surface-local toggles

**Decision**: Manage the first-slice controls from one dedicated system ops page under `/system/ops/controls`. Do not add per-page toggles or bury control changes inside each affected surface. The page shows effective-state summaries by default and exposes change history through on-demand audit links instead of creating a second history surface.

**Rationale**:
- Operators need one place to make the runtime-safety decision itself.
- Constitution `DECIDE-001` and the spec’s decision-role table require a primary decision surface for control management.
- A shared control center prevents drift between runbooks, findings, and restore surfaces.

**Evidence**:
- The repo already groups ops surfaces under `apps/platform/app/Filament/System/Pages/Ops/`.
- Existing runbooks and run viewers are already system-plane ops surfaces, so a sibling controls page fits the current information architecture.

**Alternatives considered**:
- Add a toggle to the runbooks page only.
  - Rejected: restore execution is not owned by that page and the control decision would stay fragmented.

## Decision 5 — Break-glass does not bypass operational controls in v1

**Decision**: Break-glass sessions do not automatically bypass active operational controls in the first slice.

**Rationale**:
- Operational controls are introduced as runtime-safety truth, not as optional UI friction.
- An implicit bypass would make incident behavior ambiguous and weaken auditability.
- The first slice stays safer by forcing an explicit resume action before execution.

**Evidence**:
- The system runbook page already has break-glass-aware reason requirements via `BreakGlassSession`, but operational controls are a distinct safety layer.

**Alternatives considered**:
- Let break-glass ignore controls.
  - Rejected: too risky for v1 and not required by current operator pain.

## Decision 6 — Reuse existing audit and start-result helpers, but keep global audits platform-scoped

**Decision**: Keep workspace-targeted changes and blocked execution evidence with concrete workspace or tenant context on `WorkspaceAuditLogger` plus `AuditActionId`, but record global control changes and blocked system-plane all-tenant attempts through `AuditRecorder` directly so they stay platform-plane events without false workspace ownership. Include requested-scope metadata on those platform-plane blocked attempts. Keep blocked/allowed execution messaging on the existing operation/provider start-result helpers.

**Rationale**:
- Constitution `XCUT-001` requires reuse of existing shared interaction paths.
- The repo already has shared primitives for queued toasts, dedupe messaging, and audit summaries.
- This avoids a second language for blocked execution.
- `WorkspaceAuditLogger` requires a `Workspace`, while `AuditRecorder` already supports null workspace and null tenant for truthful system-plane events.

**Evidence**:
- Audit logging lives in `apps/platform/app/Services/Audit/WorkspaceAuditLogger.php`.
- Global system-plane audit support lives in `apps/platform/app/Services/Audit/AuditRecorder.php`.
- Canonical audit IDs live in `apps/platform/app/Support/Audit/AuditActionId.php`.
- Provider-backed start messaging already routes through `ProviderOperationStartResultPresenter` and `OperationUxPresenter`.

**Alternatives considered**:
- Emit page-local notifications and free-form audit action strings.
  - Rejected: immediate drift risk and weaker reviewability.

## Decision 7 — Proof stays in Unit + Feature lanes only

**Decision**: Keep proof in focused unit and feature tests. Do not introduce browser tests or heavy-governance coverage for this first slice.

**Rationale**:
- The business truth is effective-state evaluation, audit recording, and blocked/no-side-effect execution.
- Browser coverage would mostly duplicate existing Filament modal behavior.
- Constitution `TEST-GOV-001` requires the narrowest proving lane mix.

**Evidence**:
- Existing system runbooks and restore features already have focused feature coverage patterns in the repo.
- The new logic is server-side and deterministic.

**Alternatives considered**:
- Add browser smoke for pause/resume flows.
  - Rejected: not needed to prove the core runtime-safety semantics of this slice.