Main Confidence / confidence (push) Failing after 1m23s

Details

Remove Findings lifecycle backfill operational surface (controls slice) (#280 )

Removes the Findings lifecycle backfill from the Operational Controls UI and OperationalControlCatalog.

This patch is a safe, controls-only change; runbooks, jobs and other runtime artifacts are NOT removed yet. Follow-up work will delete the runbook service/scope, jobs, commands, and update tests.

Files changed:
- apps/platform/app/Filament/System/Pages/Ops/Controls.php
- apps/platform/app/Support/OperationalControls/OperationalControlCatalog.php
- apps/platform/tests/Feature/System/OpsControls/OperationalControlManagementTest.php
- apps/platform/tests/Unit/Support/OperationalControls/OperationalControlCatalogTest.php
- apps/platform/tests/Unit/Support/OperationalControls/OperationalControlScopeResolutionTest.php

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #280

2026-04-26 15:43:47 +00:00

8.5 KiB

Raw Blame History

Research — Operational Controls

Date: 2026-04-26
Spec: spec.md

This document captures design decisions and supporting rationale for the first operational-controls slice. All decisions are grounded in current repository truth and the TenantPilot Constitution.

Decision 1 — Persist only active pause records, derive the enabled state, and let global pauses win

Decision: Store only explicit active control activations that pause a control. Do not persist enabled rows or a broader multi-state lifecycle. The effective enabled state is derived from the absence of an active matching activation, and a matching global pause wins over a narrower workspace pause in v1.

Rationale:

The operator problem is safe runtime pause control, not a new workflow state machine.
Constitution PERSIST-001 and STATE-001 favor the smallest persisted truth that changes behavior.
Deriving enabled avoids importing a second layer of default-state maintenance.
Global-first precedence is the safest bounded rule because a platform-wide incident pause must not be narrowed by a workspace-specific row in this first slice.

Evidence:

The current code gap is an env-gated yes/no maintenance switch in apps/platform/app/Filament/Resources/FindingResource/Pages/ListFindings.php.
The first slice only needs to answer one question at execution time: may this action start right now for this scope?
The first slice does not support workspace-specific allow overrides, so no narrower row should reopen a globally paused control.

Alternatives considered:

Persist both enabled and paused rows.
- Rejected: unnecessary state duplication; absence of an active pause already means enabled.
Add a larger status family such as draft, scheduled, paused, forced, emergency.
- Rejected: too broad for current-release truth.

Decision 2 — Use one platform-operated activation table instead of env flags or workspace settings

Decision: Introduce one platform-operated operational_control_activations table that can represent either a global pause or a workspace-scoped pause. Do not split truth across env flags, platform config, and workspace settings.

Rationale:

The spec requires one auditable control truth across system and tenant surfaces.
Existing workspace settings infrastructure is workspace-only and cannot represent one global platform-wide safety state cleanly.
Env flags are invisible product truth and require deploy-time coordination.

Evidence:

Existing workspace settings writer only manages workspace-scoped settings in apps/platform/app/Services/Settings/SettingsWriter.php.
The current env gate lives in apps/platform/config/tenantpilot.php and is consumed directly in ListFindings.

Alternatives considered:

Reuse workspace settings for workspace overrides and keep a global env flag.
- Rejected: split truth, inconsistent audit semantics, and no single effective-state evaluator.
Use env flags only.
- Rejected: not operator-visible or auditable in-product.

Decision 3 — Evaluate controls at the start seam, not only in UI visibility

Decision: Integrate control evaluation at the concrete start seams that already own execution decisions: FindingsLifecycleBackfillRunbookService::start() for all findings lifecycle backfill callers, and queued restore execution before OperationRun or provider dispatch begins.

Rationale:

UI-only hiding would fail the safety requirement because direct requests or stale page state could still start execution.
The repo already has clear start seams where action or service logic decides whether a run begins.
This keeps blocked-state truth server-side and shared.

Evidence:

Findings lifecycle backfill starts in apps/platform/app/Services/Runbooks/FindingsLifecycleBackfillRunbookService.php and is called from the system runbooks page, tenant findings page, tenantpilot:findings:backfill-lifecycle, and tenantpilot:run-deploy-runbooks.
Restore execution starts in apps/platform/app/Filament/Resources/RestoreRunResource.php and already routes provider-backed starts through apps/platform/app/Services/Providers/ProviderOperationStartGate.php.

Alternatives considered:

Hide or disable actions in UI only.
- Rejected: violates the server-side enforcement requirement.

Decision 4 — Add one system ops controls page instead of surface-local toggles

Decision: Manage the first-slice controls from one dedicated system ops page under /system/ops/controls. Do not add per-page toggles or bury control changes inside each affected surface. The page shows effective-state summaries by default and exposes change history through on-demand audit links instead of creating a second history surface.

Rationale:

Operators need one place to make the runtime-safety decision itself.
Constitution DECIDE-001 and the spec’s decision-role table require a primary decision surface for control management.
A shared control center prevents drift between runbooks, findings, and restore surfaces.

Evidence:

The repo already groups ops surfaces under apps/platform/app/Filament/System/Pages/Ops/.
Existing runbooks and run viewers are already system-plane ops surfaces, so a sibling controls page fits the current information architecture.

Alternatives considered:

Add a toggle to the runbooks page only.
- Rejected: restore execution is not owned by that page and the control decision would stay fragmented.

Decision 5 — Break-glass does not bypass operational controls in v1

Decision: Break-glass sessions do not automatically bypass active operational controls in the first slice.

Rationale:

Operational controls are introduced as runtime-safety truth, not as optional UI friction.
An implicit bypass would make incident behavior ambiguous and weaken auditability.
The first slice stays safer by forcing an explicit resume action before execution.

Evidence:

The system runbook page already has break-glass-aware reason requirements via BreakGlassSession, but operational controls are a distinct safety layer.

Alternatives considered:

Let break-glass ignore controls.
- Rejected: too risky for v1 and not required by current operator pain.

Decision 6 — Reuse existing audit and start-result helpers, but keep global audits platform-scoped

Decision: Keep workspace-targeted changes and blocked execution evidence with concrete workspace or tenant context on WorkspaceAuditLogger plus AuditActionId, but record global control changes and blocked system-plane all-tenant attempts through AuditRecorder directly so they stay platform-plane events without false workspace ownership. Include requested-scope metadata on those platform-plane blocked attempts. Keep blocked/allowed execution messaging on the existing operation/provider start-result helpers.

Rationale:

Constitution XCUT-001 requires reuse of existing shared interaction paths.
The repo already has shared primitives for queued toasts, dedupe messaging, and audit summaries.
This avoids a second language for blocked execution.
WorkspaceAuditLogger requires a Workspace, while AuditRecorder already supports null workspace and null tenant for truthful system-plane events.

Evidence:

Audit logging lives in apps/platform/app/Services/Audit/WorkspaceAuditLogger.php.
Global system-plane audit support lives in apps/platform/app/Services/Audit/AuditRecorder.php.
Canonical audit IDs live in apps/platform/app/Support/Audit/AuditActionId.php.
Provider-backed start messaging already routes through ProviderOperationStartResultPresenter and OperationUxPresenter.

Alternatives considered:

Emit page-local notifications and free-form audit action strings.
- Rejected: immediate drift risk and weaker reviewability.

Decision 7 — Proof stays in Unit + Feature lanes only

Decision: Keep proof in focused unit and feature tests. Do not introduce browser tests or heavy-governance coverage for this first slice.

Rationale:

The business truth is effective-state evaluation, audit recording, and blocked/no-side-effect execution.
Browser coverage would mostly duplicate existing Filament modal behavior.
Constitution TEST-GOV-001 requires the narrowest proving lane mix.

Evidence:

Existing system runbooks and restore features already have focused feature coverage patterns in the repo.
The new logic is server-side and deterministic.

Alternatives considered:

Add browser smoke for pause/resume flows.
- Rejected: not needed to prove the core runtime-safety semantics of this slice.

8.5 KiB Raw Blame History Unescape Escape

Research — Operational Controls

Decision 1 — Persist only active pause records, derive the enabled state, and let global pauses win

Decision 2 — Use one platform-operated activation table instead of env flags or workspace settings

Decision 3 — Evaluate controls at the start seam, not only in UI visibility

Decision 4 — Add one system ops controls page instead of surface-local toggles

Decision 5 — Break-glass does not bypass operational controls in v1

Decision 6 — Reuse existing audit and start-result helpers, but keep global audits platform-scoped

Decision 7 — Proof stays in Unit + Feature lanes only

8.5 KiB

Raw Blame History