TenantAtlas/specs/242-operational-controls/research.md
ahmido d96abc65fb
Some checks failed
Main Confidence / confidence (push) Failing after 1m23s
Remove Findings lifecycle backfill operational surface (controls slice) (#280)
Removes the Findings lifecycle backfill from the Operational Controls UI and OperationalControlCatalog.

This patch is a safe, controls-only change; runbooks, jobs and other runtime artifacts are NOT removed yet. Follow-up work will delete the runbook service/scope, jobs, commands, and update tests.

Files changed:
- apps/platform/app/Filament/System/Pages/Ops/Controls.php
- apps/platform/app/Support/OperationalControls/OperationalControlCatalog.php
- apps/platform/tests/Feature/System/OpsControls/OperationalControlManagementTest.php
- apps/platform/tests/Unit/Support/OperationalControls/OperationalControlCatalogTest.php
- apps/platform/tests/Unit/Support/OperationalControls/OperationalControlScopeResolutionTest.php

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #280
2026-04-26 15:43:47 +00:00

133 lines
8.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Research — Operational Controls
**Date**: 2026-04-26
**Spec**: [spec.md](spec.md)
This document captures design decisions and supporting rationale for the first operational-controls slice. All decisions are grounded in current repository truth and the TenantPilot Constitution.
## Decision 1 — Persist only active pause records, derive the enabled state, and let global pauses win
**Decision**: Store only explicit active control activations that pause a control. Do not persist `enabled` rows or a broader multi-state lifecycle. The effective `enabled` state is derived from the absence of an active matching activation, and a matching global pause wins over a narrower workspace pause in v1.
**Rationale**:
- The operator problem is safe runtime pause control, not a new workflow state machine.
- Constitution `PERSIST-001` and `STATE-001` favor the smallest persisted truth that changes behavior.
- Deriving `enabled` avoids importing a second layer of default-state maintenance.
- Global-first precedence is the safest bounded rule because a platform-wide incident pause must not be narrowed by a workspace-specific row in this first slice.
**Evidence**:
- The current code gap is an env-gated yes/no maintenance switch in `apps/platform/app/Filament/Resources/FindingResource/Pages/ListFindings.php`.
- The first slice only needs to answer one question at execution time: may this action start right now for this scope?
- The first slice does not support workspace-specific allow overrides, so no narrower row should reopen a globally paused control.
**Alternatives considered**:
- Persist both `enabled` and `paused` rows.
- Rejected: unnecessary state duplication; absence of an active pause already means enabled.
- Add a larger status family such as draft, scheduled, paused, forced, emergency.
- Rejected: too broad for current-release truth.
## Decision 2 — Use one platform-operated activation table instead of env flags or workspace settings
**Decision**: Introduce one platform-operated `operational_control_activations` table that can represent either a global pause or a workspace-scoped pause. Do not split truth across env flags, platform config, and workspace settings.
**Rationale**:
- The spec requires one auditable control truth across system and tenant surfaces.
- Existing workspace settings infrastructure is workspace-only and cannot represent one global platform-wide safety state cleanly.
- Env flags are invisible product truth and require deploy-time coordination.
**Evidence**:
- Existing workspace settings writer only manages workspace-scoped settings in `apps/platform/app/Services/Settings/SettingsWriter.php`.
- The current env gate lives in `apps/platform/config/tenantpilot.php` and is consumed directly in `ListFindings`.
**Alternatives considered**:
- Reuse workspace settings for workspace overrides and keep a global env flag.
- Rejected: split truth, inconsistent audit semantics, and no single effective-state evaluator.
- Use env flags only.
- Rejected: not operator-visible or auditable in-product.
## Decision 3 — Evaluate controls at the start seam, not only in UI visibility
**Decision**: Integrate control evaluation at the concrete start seams that already own execution decisions: `FindingsLifecycleBackfillRunbookService::start()` for all findings lifecycle backfill callers, and queued restore execution before `OperationRun` or provider dispatch begins.
**Rationale**:
- UI-only hiding would fail the safety requirement because direct requests or stale page state could still start execution.
- The repo already has clear start seams where action or service logic decides whether a run begins.
- This keeps blocked-state truth server-side and shared.
**Evidence**:
- Findings lifecycle backfill starts in `apps/platform/app/Services/Runbooks/FindingsLifecycleBackfillRunbookService.php` and is called from the system runbooks page, tenant findings page, `tenantpilot:findings:backfill-lifecycle`, and `tenantpilot:run-deploy-runbooks`.
- Restore execution starts in `apps/platform/app/Filament/Resources/RestoreRunResource.php` and already routes provider-backed starts through `apps/platform/app/Services/Providers/ProviderOperationStartGate.php`.
**Alternatives considered**:
- Hide or disable actions in UI only.
- Rejected: violates the server-side enforcement requirement.
## Decision 4 — Add one system ops controls page instead of surface-local toggles
**Decision**: Manage the first-slice controls from one dedicated system ops page under `/system/ops/controls`. Do not add per-page toggles or bury control changes inside each affected surface. The page shows effective-state summaries by default and exposes change history through on-demand audit links instead of creating a second history surface.
**Rationale**:
- Operators need one place to make the runtime-safety decision itself.
- Constitution `DECIDE-001` and the specs decision-role table require a primary decision surface for control management.
- A shared control center prevents drift between runbooks, findings, and restore surfaces.
**Evidence**:
- The repo already groups ops surfaces under `apps/platform/app/Filament/System/Pages/Ops/`.
- Existing runbooks and run viewers are already system-plane ops surfaces, so a sibling controls page fits the current information architecture.
**Alternatives considered**:
- Add a toggle to the runbooks page only.
- Rejected: restore execution is not owned by that page and the control decision would stay fragmented.
## Decision 5 — Break-glass does not bypass operational controls in v1
**Decision**: Break-glass sessions do not automatically bypass active operational controls in the first slice.
**Rationale**:
- Operational controls are introduced as runtime-safety truth, not as optional UI friction.
- An implicit bypass would make incident behavior ambiguous and weaken auditability.
- The first slice stays safer by forcing an explicit resume action before execution.
**Evidence**:
- The system runbook page already has break-glass-aware reason requirements via `BreakGlassSession`, but operational controls are a distinct safety layer.
**Alternatives considered**:
- Let break-glass ignore controls.
- Rejected: too risky for v1 and not required by current operator pain.
## Decision 6 — Reuse existing audit and start-result helpers, but keep global audits platform-scoped
**Decision**: Keep workspace-targeted changes and blocked execution evidence with concrete workspace or tenant context on `WorkspaceAuditLogger` plus `AuditActionId`, but record global control changes and blocked system-plane all-tenant attempts through `AuditRecorder` directly so they stay platform-plane events without false workspace ownership. Include requested-scope metadata on those platform-plane blocked attempts. Keep blocked/allowed execution messaging on the existing operation/provider start-result helpers.
**Rationale**:
- Constitution `XCUT-001` requires reuse of existing shared interaction paths.
- The repo already has shared primitives for queued toasts, dedupe messaging, and audit summaries.
- This avoids a second language for blocked execution.
- `WorkspaceAuditLogger` requires a `Workspace`, while `AuditRecorder` already supports null workspace and null tenant for truthful system-plane events.
**Evidence**:
- Audit logging lives in `apps/platform/app/Services/Audit/WorkspaceAuditLogger.php`.
- Global system-plane audit support lives in `apps/platform/app/Services/Audit/AuditRecorder.php`.
- Canonical audit IDs live in `apps/platform/app/Support/Audit/AuditActionId.php`.
- Provider-backed start messaging already routes through `ProviderOperationStartResultPresenter` and `OperationUxPresenter`.
**Alternatives considered**:
- Emit page-local notifications and free-form audit action strings.
- Rejected: immediate drift risk and weaker reviewability.
## Decision 7 — Proof stays in Unit + Feature lanes only
**Decision**: Keep proof in focused unit and feature tests. Do not introduce browser tests or heavy-governance coverage for this first slice.
**Rationale**:
- The business truth is effective-state evaluation, audit recording, and blocked/no-side-effect execution.
- Browser coverage would mostly duplicate existing Filament modal behavior.
- Constitution `TEST-GOV-001` requires the narrowest proving lane mix.
**Evidence**:
- Existing system runbooks and restore features already have focused feature coverage patterns in the repo.
- The new logic is server-side and deterministic.
**Alternatives considered**:
- Add browser smoke for pause/resume flows.
- Rejected: not needed to prove the core runtime-safety semantics of this slice.