Some checks failed
Main Confidence / confidence (push) Failing after 1m23s
Removes the Findings lifecycle backfill from the Operational Controls UI and OperationalControlCatalog. This patch is a safe, controls-only change; runbooks, jobs and other runtime artifacts are NOT removed yet. Follow-up work will delete the runbook service/scope, jobs, commands, and update tests. Files changed: - apps/platform/app/Filament/System/Pages/Ops/Controls.php - apps/platform/app/Support/OperationalControls/OperationalControlCatalog.php - apps/platform/tests/Feature/System/OpsControls/OperationalControlManagementTest.php - apps/platform/tests/Unit/Support/OperationalControls/OperationalControlCatalogTest.php - apps/platform/tests/Unit/Support/OperationalControls/OperationalControlScopeResolutionTest.php Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de> Reviewed-on: #280
133 lines
8.5 KiB
Markdown
133 lines
8.5 KiB
Markdown
# Research — Operational Controls
|
||
|
||
**Date**: 2026-04-26
|
||
**Spec**: [spec.md](spec.md)
|
||
|
||
This document captures design decisions and supporting rationale for the first operational-controls slice. All decisions are grounded in current repository truth and the TenantPilot Constitution.
|
||
|
||
## Decision 1 — Persist only active pause records, derive the enabled state, and let global pauses win
|
||
|
||
**Decision**: Store only explicit active control activations that pause a control. Do not persist `enabled` rows or a broader multi-state lifecycle. The effective `enabled` state is derived from the absence of an active matching activation, and a matching global pause wins over a narrower workspace pause in v1.
|
||
|
||
**Rationale**:
|
||
- The operator problem is safe runtime pause control, not a new workflow state machine.
|
||
- Constitution `PERSIST-001` and `STATE-001` favor the smallest persisted truth that changes behavior.
|
||
- Deriving `enabled` avoids importing a second layer of default-state maintenance.
|
||
- Global-first precedence is the safest bounded rule because a platform-wide incident pause must not be narrowed by a workspace-specific row in this first slice.
|
||
|
||
**Evidence**:
|
||
- The current code gap is an env-gated yes/no maintenance switch in `apps/platform/app/Filament/Resources/FindingResource/Pages/ListFindings.php`.
|
||
- The first slice only needs to answer one question at execution time: may this action start right now for this scope?
|
||
- The first slice does not support workspace-specific allow overrides, so no narrower row should reopen a globally paused control.
|
||
|
||
**Alternatives considered**:
|
||
- Persist both `enabled` and `paused` rows.
|
||
- Rejected: unnecessary state duplication; absence of an active pause already means enabled.
|
||
- Add a larger status family such as draft, scheduled, paused, forced, emergency.
|
||
- Rejected: too broad for current-release truth.
|
||
|
||
## Decision 2 — Use one platform-operated activation table instead of env flags or workspace settings
|
||
|
||
**Decision**: Introduce one platform-operated `operational_control_activations` table that can represent either a global pause or a workspace-scoped pause. Do not split truth across env flags, platform config, and workspace settings.
|
||
|
||
**Rationale**:
|
||
- The spec requires one auditable control truth across system and tenant surfaces.
|
||
- Existing workspace settings infrastructure is workspace-only and cannot represent one global platform-wide safety state cleanly.
|
||
- Env flags are invisible product truth and require deploy-time coordination.
|
||
|
||
**Evidence**:
|
||
- Existing workspace settings writer only manages workspace-scoped settings in `apps/platform/app/Services/Settings/SettingsWriter.php`.
|
||
- The current env gate lives in `apps/platform/config/tenantpilot.php` and is consumed directly in `ListFindings`.
|
||
|
||
**Alternatives considered**:
|
||
- Reuse workspace settings for workspace overrides and keep a global env flag.
|
||
- Rejected: split truth, inconsistent audit semantics, and no single effective-state evaluator.
|
||
- Use env flags only.
|
||
- Rejected: not operator-visible or auditable in-product.
|
||
|
||
## Decision 3 — Evaluate controls at the start seam, not only in UI visibility
|
||
|
||
**Decision**: Integrate control evaluation at the concrete start seams that already own execution decisions: `FindingsLifecycleBackfillRunbookService::start()` for all findings lifecycle backfill callers, and queued restore execution before `OperationRun` or provider dispatch begins.
|
||
|
||
**Rationale**:
|
||
- UI-only hiding would fail the safety requirement because direct requests or stale page state could still start execution.
|
||
- The repo already has clear start seams where action or service logic decides whether a run begins.
|
||
- This keeps blocked-state truth server-side and shared.
|
||
|
||
**Evidence**:
|
||
- Findings lifecycle backfill starts in `apps/platform/app/Services/Runbooks/FindingsLifecycleBackfillRunbookService.php` and is called from the system runbooks page, tenant findings page, `tenantpilot:findings:backfill-lifecycle`, and `tenantpilot:run-deploy-runbooks`.
|
||
- Restore execution starts in `apps/platform/app/Filament/Resources/RestoreRunResource.php` and already routes provider-backed starts through `apps/platform/app/Services/Providers/ProviderOperationStartGate.php`.
|
||
|
||
**Alternatives considered**:
|
||
- Hide or disable actions in UI only.
|
||
- Rejected: violates the server-side enforcement requirement.
|
||
|
||
## Decision 4 — Add one system ops controls page instead of surface-local toggles
|
||
|
||
**Decision**: Manage the first-slice controls from one dedicated system ops page under `/system/ops/controls`. Do not add per-page toggles or bury control changes inside each affected surface. The page shows effective-state summaries by default and exposes change history through on-demand audit links instead of creating a second history surface.
|
||
|
||
**Rationale**:
|
||
- Operators need one place to make the runtime-safety decision itself.
|
||
- Constitution `DECIDE-001` and the spec’s decision-role table require a primary decision surface for control management.
|
||
- A shared control center prevents drift between runbooks, findings, and restore surfaces.
|
||
|
||
**Evidence**:
|
||
- The repo already groups ops surfaces under `apps/platform/app/Filament/System/Pages/Ops/`.
|
||
- Existing runbooks and run viewers are already system-plane ops surfaces, so a sibling controls page fits the current information architecture.
|
||
|
||
**Alternatives considered**:
|
||
- Add a toggle to the runbooks page only.
|
||
- Rejected: restore execution is not owned by that page and the control decision would stay fragmented.
|
||
|
||
## Decision 5 — Break-glass does not bypass operational controls in v1
|
||
|
||
**Decision**: Break-glass sessions do not automatically bypass active operational controls in the first slice.
|
||
|
||
**Rationale**:
|
||
- Operational controls are introduced as runtime-safety truth, not as optional UI friction.
|
||
- An implicit bypass would make incident behavior ambiguous and weaken auditability.
|
||
- The first slice stays safer by forcing an explicit resume action before execution.
|
||
|
||
**Evidence**:
|
||
- The system runbook page already has break-glass-aware reason requirements via `BreakGlassSession`, but operational controls are a distinct safety layer.
|
||
|
||
**Alternatives considered**:
|
||
- Let break-glass ignore controls.
|
||
- Rejected: too risky for v1 and not required by current operator pain.
|
||
|
||
## Decision 6 — Reuse existing audit and start-result helpers, but keep global audits platform-scoped
|
||
|
||
**Decision**: Keep workspace-targeted changes and blocked execution evidence with concrete workspace or tenant context on `WorkspaceAuditLogger` plus `AuditActionId`, but record global control changes and blocked system-plane all-tenant attempts through `AuditRecorder` directly so they stay platform-plane events without false workspace ownership. Include requested-scope metadata on those platform-plane blocked attempts. Keep blocked/allowed execution messaging on the existing operation/provider start-result helpers.
|
||
|
||
**Rationale**:
|
||
- Constitution `XCUT-001` requires reuse of existing shared interaction paths.
|
||
- The repo already has shared primitives for queued toasts, dedupe messaging, and audit summaries.
|
||
- This avoids a second language for blocked execution.
|
||
- `WorkspaceAuditLogger` requires a `Workspace`, while `AuditRecorder` already supports null workspace and null tenant for truthful system-plane events.
|
||
|
||
**Evidence**:
|
||
- Audit logging lives in `apps/platform/app/Services/Audit/WorkspaceAuditLogger.php`.
|
||
- Global system-plane audit support lives in `apps/platform/app/Services/Audit/AuditRecorder.php`.
|
||
- Canonical audit IDs live in `apps/platform/app/Support/Audit/AuditActionId.php`.
|
||
- Provider-backed start messaging already routes through `ProviderOperationStartResultPresenter` and `OperationUxPresenter`.
|
||
|
||
**Alternatives considered**:
|
||
- Emit page-local notifications and free-form audit action strings.
|
||
- Rejected: immediate drift risk and weaker reviewability.
|
||
|
||
## Decision 7 — Proof stays in Unit + Feature lanes only
|
||
|
||
**Decision**: Keep proof in focused unit and feature tests. Do not introduce browser tests or heavy-governance coverage for this first slice.
|
||
|
||
**Rationale**:
|
||
- The business truth is effective-state evaluation, audit recording, and blocked/no-side-effect execution.
|
||
- Browser coverage would mostly duplicate existing Filament modal behavior.
|
||
- Constitution `TEST-GOV-001` requires the narrowest proving lane mix.
|
||
|
||
**Evidence**:
|
||
- Existing system runbooks and restore features already have focused feature coverage patterns in the repo.
|
||
- The new logic is server-side and deterministic.
|
||
|
||
**Alternatives considered**:
|
||
- Add browser smoke for pause/resume flows.
|
||
- Rejected: not needed to prove the core runtime-safety semantics of this slice. |