# Research — Operational Controls **Date**: 2026-04-26 **Spec**: [spec.md](spec.md) This document captures design decisions and supporting rationale for the first operational-controls slice. All decisions are grounded in current repository truth and the TenantPilot Constitution. ## Decision 1 — Persist only active pause records, derive the enabled state, and let global pauses win **Decision**: Store only explicit active control activations that pause a control. Do not persist `enabled` rows or a broader multi-state lifecycle. The effective `enabled` state is derived from the absence of an active matching activation, and a matching global pause wins over a narrower workspace pause in v1. **Rationale**: - The operator problem is safe runtime pause control, not a new workflow state machine. - Constitution `PERSIST-001` and `STATE-001` favor the smallest persisted truth that changes behavior. - Deriving `enabled` avoids importing a second layer of default-state maintenance. - Global-first precedence is the safest bounded rule because a platform-wide incident pause must not be narrowed by a workspace-specific row in this first slice. **Evidence**: - The current code gap is an env-gated yes/no maintenance switch in `apps/platform/app/Filament/Resources/FindingResource/Pages/ListFindings.php`. - The first slice only needs to answer one question at execution time: may this action start right now for this scope? - The first slice does not support workspace-specific allow overrides, so no narrower row should reopen a globally paused control. **Alternatives considered**: - Persist both `enabled` and `paused` rows. - Rejected: unnecessary state duplication; absence of an active pause already means enabled. - Add a larger status family such as draft, scheduled, paused, forced, emergency. - Rejected: too broad for current-release truth. ## Decision 2 — Use one platform-operated activation table instead of env flags or workspace settings **Decision**: Introduce one platform-operated `operational_control_activations` table that can represent either a global pause or a workspace-scoped pause. Do not split truth across env flags, platform config, and workspace settings. **Rationale**: - The spec requires one auditable control truth across system and tenant surfaces. - Existing workspace settings infrastructure is workspace-only and cannot represent one global platform-wide safety state cleanly. - Env flags are invisible product truth and require deploy-time coordination. **Evidence**: - Existing workspace settings writer only manages workspace-scoped settings in `apps/platform/app/Services/Settings/SettingsWriter.php`. - The current env gate lives in `apps/platform/config/tenantpilot.php` and is consumed directly in `ListFindings`. **Alternatives considered**: - Reuse workspace settings for workspace overrides and keep a global env flag. - Rejected: split truth, inconsistent audit semantics, and no single effective-state evaluator. - Use env flags only. - Rejected: not operator-visible or auditable in-product. ## Decision 3 — Evaluate controls at the start seam, not only in UI visibility **Decision**: Integrate control evaluation at the concrete start seams that already own execution decisions: `FindingsLifecycleBackfillRunbookService::start()` for all findings lifecycle backfill callers, and queued restore execution before `OperationRun` or provider dispatch begins. **Rationale**: - UI-only hiding would fail the safety requirement because direct requests or stale page state could still start execution. - The repo already has clear start seams where action or service logic decides whether a run begins. - This keeps blocked-state truth server-side and shared. **Evidence**: - Findings lifecycle backfill starts in `apps/platform/app/Services/Runbooks/FindingsLifecycleBackfillRunbookService.php` and is called from the system runbooks page, tenant findings page, `tenantpilot:findings:backfill-lifecycle`, and `tenantpilot:run-deploy-runbooks`. - Restore execution starts in `apps/platform/app/Filament/Resources/RestoreRunResource.php` and already routes provider-backed starts through `apps/platform/app/Services/Providers/ProviderOperationStartGate.php`. **Alternatives considered**: - Hide or disable actions in UI only. - Rejected: violates the server-side enforcement requirement. ## Decision 4 — Add one system ops controls page instead of surface-local toggles **Decision**: Manage the first-slice controls from one dedicated system ops page under `/system/ops/controls`. Do not add per-page toggles or bury control changes inside each affected surface. The page shows effective-state summaries by default and exposes change history through on-demand audit links instead of creating a second history surface. **Rationale**: - Operators need one place to make the runtime-safety decision itself. - Constitution `DECIDE-001` and the spec’s decision-role table require a primary decision surface for control management. - A shared control center prevents drift between runbooks, findings, and restore surfaces. **Evidence**: - The repo already groups ops surfaces under `apps/platform/app/Filament/System/Pages/Ops/`. - Existing runbooks and run viewers are already system-plane ops surfaces, so a sibling controls page fits the current information architecture. **Alternatives considered**: - Add a toggle to the runbooks page only. - Rejected: restore execution is not owned by that page and the control decision would stay fragmented. ## Decision 5 — Break-glass does not bypass operational controls in v1 **Decision**: Break-glass sessions do not automatically bypass active operational controls in the first slice. **Rationale**: - Operational controls are introduced as runtime-safety truth, not as optional UI friction. - An implicit bypass would make incident behavior ambiguous and weaken auditability. - The first slice stays safer by forcing an explicit resume action before execution. **Evidence**: - The system runbook page already has break-glass-aware reason requirements via `BreakGlassSession`, but operational controls are a distinct safety layer. **Alternatives considered**: - Let break-glass ignore controls. - Rejected: too risky for v1 and not required by current operator pain. ## Decision 6 — Reuse existing audit and start-result helpers, but keep global audits platform-scoped **Decision**: Keep workspace-targeted changes and blocked execution evidence with concrete workspace or tenant context on `WorkspaceAuditLogger` plus `AuditActionId`, but record global control changes and blocked system-plane all-tenant attempts through `AuditRecorder` directly so they stay platform-plane events without false workspace ownership. Include requested-scope metadata on those platform-plane blocked attempts. Keep blocked/allowed execution messaging on the existing operation/provider start-result helpers. **Rationale**: - Constitution `XCUT-001` requires reuse of existing shared interaction paths. - The repo already has shared primitives for queued toasts, dedupe messaging, and audit summaries. - This avoids a second language for blocked execution. - `WorkspaceAuditLogger` requires a `Workspace`, while `AuditRecorder` already supports null workspace and null tenant for truthful system-plane events. **Evidence**: - Audit logging lives in `apps/platform/app/Services/Audit/WorkspaceAuditLogger.php`. - Global system-plane audit support lives in `apps/platform/app/Services/Audit/AuditRecorder.php`. - Canonical audit IDs live in `apps/platform/app/Support/Audit/AuditActionId.php`. - Provider-backed start messaging already routes through `ProviderOperationStartResultPresenter` and `OperationUxPresenter`. **Alternatives considered**: - Emit page-local notifications and free-form audit action strings. - Rejected: immediate drift risk and weaker reviewability. ## Decision 7 — Proof stays in Unit + Feature lanes only **Decision**: Keep proof in focused unit and feature tests. Do not introduce browser tests or heavy-governance coverage for this first slice. **Rationale**: - The business truth is effective-state evaluation, audit recording, and blocked/no-side-effect execution. - Browser coverage would mostly duplicate existing Filament modal behavior. - Constitution `TEST-GOV-001` requires the narrowest proving lane mix. **Evidence**: - Existing system runbooks and restore features already have focused feature coverage patterns in the repo. - The new logic is server-side and deterministic. **Alternatives considered**: - Add browser smoke for pause/resume flows. - Rejected: not needed to prove the core runtime-safety semantics of this slice.