Removes the Findings lifecycle backfill from the Operational Controls UI and OperationalControlCatalog. This patch is a safe, controls-only change; runbooks, jobs and other runtime artifacts are NOT removed yet. Follow-up work will delete the runbook service/scope, jobs, commands, and update tests. Files changed: - apps/platform/app/Filament/System/Pages/Ops/Controls.php - apps/platform/app/Support/OperationalControls/OperationalControlCatalog.php - apps/platform/tests/Feature/System/OpsControls/OperationalControlManagementTest.php - apps/platform/tests/Unit/Support/OperationalControls/OperationalControlCatalogTest.php - apps/platform/tests/Unit/Support/OperationalControls/OperationalControlScopeResolutionTest.php Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de> Reviewed-on: #280
8.5 KiB
Research — Operational Controls
Date: 2026-04-26
Spec: spec.md
This document captures design decisions and supporting rationale for the first operational-controls slice. All decisions are grounded in current repository truth and the TenantPilot Constitution.
Decision 1 — Persist only active pause records, derive the enabled state, and let global pauses win
Decision: Store only explicit active control activations that pause a control. Do not persist enabled rows or a broader multi-state lifecycle. The effective enabled state is derived from the absence of an active matching activation, and a matching global pause wins over a narrower workspace pause in v1.
Rationale:
- The operator problem is safe runtime pause control, not a new workflow state machine.
- Constitution
PERSIST-001andSTATE-001favor the smallest persisted truth that changes behavior. - Deriving
enabledavoids importing a second layer of default-state maintenance. - Global-first precedence is the safest bounded rule because a platform-wide incident pause must not be narrowed by a workspace-specific row in this first slice.
Evidence:
- The current code gap is an env-gated yes/no maintenance switch in
apps/platform/app/Filament/Resources/FindingResource/Pages/ListFindings.php. - The first slice only needs to answer one question at execution time: may this action start right now for this scope?
- The first slice does not support workspace-specific allow overrides, so no narrower row should reopen a globally paused control.
Alternatives considered:
- Persist both
enabledandpausedrows.- Rejected: unnecessary state duplication; absence of an active pause already means enabled.
- Add a larger status family such as draft, scheduled, paused, forced, emergency.
- Rejected: too broad for current-release truth.
Decision 2 — Use one platform-operated activation table instead of env flags or workspace settings
Decision: Introduce one platform-operated operational_control_activations table that can represent either a global pause or a workspace-scoped pause. Do not split truth across env flags, platform config, and workspace settings.
Rationale:
- The spec requires one auditable control truth across system and tenant surfaces.
- Existing workspace settings infrastructure is workspace-only and cannot represent one global platform-wide safety state cleanly.
- Env flags are invisible product truth and require deploy-time coordination.
Evidence:
- Existing workspace settings writer only manages workspace-scoped settings in
apps/platform/app/Services/Settings/SettingsWriter.php. - The current env gate lives in
apps/platform/config/tenantpilot.phpand is consumed directly inListFindings.
Alternatives considered:
- Reuse workspace settings for workspace overrides and keep a global env flag.
- Rejected: split truth, inconsistent audit semantics, and no single effective-state evaluator.
- Use env flags only.
- Rejected: not operator-visible or auditable in-product.
Decision 3 — Evaluate controls at the start seam, not only in UI visibility
Decision: Integrate control evaluation at the concrete start seams that already own execution decisions: FindingsLifecycleBackfillRunbookService::start() for all findings lifecycle backfill callers, and queued restore execution before OperationRun or provider dispatch begins.
Rationale:
- UI-only hiding would fail the safety requirement because direct requests or stale page state could still start execution.
- The repo already has clear start seams where action or service logic decides whether a run begins.
- This keeps blocked-state truth server-side and shared.
Evidence:
- Findings lifecycle backfill starts in
apps/platform/app/Services/Runbooks/FindingsLifecycleBackfillRunbookService.phpand is called from the system runbooks page, tenant findings page,tenantpilot:findings:backfill-lifecycle, andtenantpilot:run-deploy-runbooks. - Restore execution starts in
apps/platform/app/Filament/Resources/RestoreRunResource.phpand already routes provider-backed starts throughapps/platform/app/Services/Providers/ProviderOperationStartGate.php.
Alternatives considered:
- Hide or disable actions in UI only.
- Rejected: violates the server-side enforcement requirement.
Decision 4 — Add one system ops controls page instead of surface-local toggles
Decision: Manage the first-slice controls from one dedicated system ops page under /system/ops/controls. Do not add per-page toggles or bury control changes inside each affected surface. The page shows effective-state summaries by default and exposes change history through on-demand audit links instead of creating a second history surface.
Rationale:
- Operators need one place to make the runtime-safety decision itself.
- Constitution
DECIDE-001and the spec’s decision-role table require a primary decision surface for control management. - A shared control center prevents drift between runbooks, findings, and restore surfaces.
Evidence:
- The repo already groups ops surfaces under
apps/platform/app/Filament/System/Pages/Ops/. - Existing runbooks and run viewers are already system-plane ops surfaces, so a sibling controls page fits the current information architecture.
Alternatives considered:
- Add a toggle to the runbooks page only.
- Rejected: restore execution is not owned by that page and the control decision would stay fragmented.
Decision 5 — Break-glass does not bypass operational controls in v1
Decision: Break-glass sessions do not automatically bypass active operational controls in the first slice.
Rationale:
- Operational controls are introduced as runtime-safety truth, not as optional UI friction.
- An implicit bypass would make incident behavior ambiguous and weaken auditability.
- The first slice stays safer by forcing an explicit resume action before execution.
Evidence:
- The system runbook page already has break-glass-aware reason requirements via
BreakGlassSession, but operational controls are a distinct safety layer.
Alternatives considered:
- Let break-glass ignore controls.
- Rejected: too risky for v1 and not required by current operator pain.
Decision 6 — Reuse existing audit and start-result helpers, but keep global audits platform-scoped
Decision: Keep workspace-targeted changes and blocked execution evidence with concrete workspace or tenant context on WorkspaceAuditLogger plus AuditActionId, but record global control changes and blocked system-plane all-tenant attempts through AuditRecorder directly so they stay platform-plane events without false workspace ownership. Include requested-scope metadata on those platform-plane blocked attempts. Keep blocked/allowed execution messaging on the existing operation/provider start-result helpers.
Rationale:
- Constitution
XCUT-001requires reuse of existing shared interaction paths. - The repo already has shared primitives for queued toasts, dedupe messaging, and audit summaries.
- This avoids a second language for blocked execution.
WorkspaceAuditLoggerrequires aWorkspace, whileAuditRecorderalready supports null workspace and null tenant for truthful system-plane events.
Evidence:
- Audit logging lives in
apps/platform/app/Services/Audit/WorkspaceAuditLogger.php. - Global system-plane audit support lives in
apps/platform/app/Services/Audit/AuditRecorder.php. - Canonical audit IDs live in
apps/platform/app/Support/Audit/AuditActionId.php. - Provider-backed start messaging already routes through
ProviderOperationStartResultPresenterandOperationUxPresenter.
Alternatives considered:
- Emit page-local notifications and free-form audit action strings.
- Rejected: immediate drift risk and weaker reviewability.
Decision 7 — Proof stays in Unit + Feature lanes only
Decision: Keep proof in focused unit and feature tests. Do not introduce browser tests or heavy-governance coverage for this first slice.
Rationale:
- The business truth is effective-state evaluation, audit recording, and blocked/no-side-effect execution.
- Browser coverage would mostly duplicate existing Filament modal behavior.
- Constitution
TEST-GOV-001requires the narrowest proving lane mix.
Evidence:
- Existing system runbooks and restore features already have focused feature coverage patterns in the repo.
- The new logic is server-side and deterministic.
Alternatives considered:
- Add browser smoke for pause/resume flows.
- Rejected: not needed to prove the core runtime-safety semantics of this slice.