TenantAtlas/specs/242-operational-controls/spec.md

# Feature Specification: Operational Controls

**Feature Branch**: `242-operational-controls`
**Created**: 2026-04-26
**Status**: Draft
**Input**: User description: "Operational Controls & Feature Flags: create a narrow first slice that replaces ad-hoc environment-gated risky admin maintenance actions with a central audited operational control path. Reuse the existing system panel, platform capabilities, audit logging, and server-side action/service enforcement to let operators pause or disable selected high-risk features with explicit disabled-state messaging that is distinct from authorization failure, without turning this into a generic experimentation or entitlement platform."

## Spec Candidate Check *(mandatory - SPEC-GATE-001)*

- **Problem**: TenantPilot already has risky actions that can only be paused through local environment flags, deploy-time changes, or ad-hoc code decisions instead of one product-owned operational control contract.
- **Today's failure**: During an incident or rollout concern, operators cannot centrally pause in-scope risky flows such as findings lifecycle backfill or restore execution with consistent UX, auditable ownership, and server-side enforcement. The current `allow_admin_maintenance_actions` environment gate makes one tenant admin action appear or disappear outside the product, while similar runbook and provider-backed actions have no shared pause contract.
- **User-visible improvement**: Platform operators can pause selected high-risk actions from a system-plane control surface, and affected admin/operator surfaces show explicit paused-state messaging instead of disappearing silently, looking unauthorized, or requiring a deployment to change runtime behavior.
- **Smallest enterprise-capable version**: Introduce one bounded operational-control contract for two concrete first-slice controls - `findings.lifecycle.backfill` and `restore.execute` - with global and workspace-targeted activation, one system-plane management surface with on-demand audit history, and server-side enforcement on the existing runbook, findings-maintenance, and restore-execution start paths.
- **Explicit non-goals**: No A/B testing, no customer-managed feature flags, no generic remote-config platform, no entitlement/billing replacement, no tenant-scoped self-service flags, no broad maintenance-mode replacement for the whole app, and no speculative control catalog for every future feature.
- **Permanent complexity imported**: One operational-control catalog, one persisted control-activation record family with explicit scope and reason, one evaluation service at action/service boundaries, a small amount of shared paused-state copy/presentation, audit action IDs, and focused unit/feature/guard coverage.
- **Why now**: The repo already exposes the control gap in live code through `config('tenantpilot.allow_admin_maintenance_actions')`, while live pilots and founder-operated support increase the need for safe runtime pause controls before more onboarding, support, AI, and customer-facing workflows land.
- **Why not local**: A local config flag or page-specific guard cannot safely cover both system-plane runbooks and tenant-plane provider-backed execution, cannot produce one auditable truth, and teaches parallel runtime-control semantics across surfaces.
- **Approval class**: Core Enterprise
- **Red flags triggered**: New meta-infrastructure, foundation-sounding scope. Defense: the slice is intentionally limited to two real existing high-risk controls, one management surface, and one shared evaluator instead of a universal experimentation or entitlement platform.
- **Score**: Nutzen: 2 | Dringlichkeit: 2 | Scope: 2 | Komplexitaet: 1 | Produktnaehe: 2 | Wiederverwendung: 2 | **Gesamt: 11/12**
- **Decision**: approve

## Spec Scope Fields *(mandatory)*

- **Scope**: platform, workspace, tenant
- **Primary Routes**:
  - New system-plane operational controls surface under `/system/ops/controls`
  - Existing system runbook launcher at `/system/ops/runbooks`
  - Existing tenant findings register at `/admin/t/{tenant}/findings`
  - Existing restore execution start flow in the restore-run create surface under `/admin/t/{tenant}/restore-runs/create`; existing restore record-view actions remain unchanged and out of scope for this slice
- **Data Ownership**:
  - Control definitions remain platform-owned catalog truth in code, limited to the first-slice keys `findings.lifecycle.backfill` and `restore.execute`
  - Control activations are platform-operated runtime-safety records; workspace-targeted activations reference a workspace, while global activations apply to all workspaces without embedding tenant-owned data
  - No tenant-owned control records are introduced in this slice; tenant/admin surfaces consume effective control decisions only
  - Audit history stays on existing `AuditLog` truth with stable action IDs for control activation, update, removal, and blocked execution; global control changes are platform-plane audit events with no false workspace or tenant owner, workspace-targeted changes and blocked starts with concrete workspace/tenant context retain truthful workspace/tenant audit scope, and blocked system-plane attempts without a concrete workspace/tenant resolve to platform-plane audit events with requested-scope metadata
- **RBAC**:
  - Management happens only in the platform `/system` plane and requires `PlatformCapabilities::ACCESS_SYSTEM_PANEL` plus a dedicated operational-controls management capability
  - Existing tenant/admin capabilities remain authoritative for the underlying in-scope actions (`findings.lifecycle.backfill`, `restore.execute`)
  - Non-members or non-entitled users still receive 404 on tenant/workspace boundaries; members lacking the underlying capability still receive 403 and continue to follow the existing surface-specific capability-denied UX with no paused-state helper text; entitled users blocked only by an active operational control receive explicit paused-state feedback that is distinct from authorization failure

For canonical-view specs, the spec MUST define:

- **Default filter behavior when tenant-context is active**: N/A - this slice does not add a new canonical collection route in `/admin`; it affects existing tenant and system execution surfaces
- **Explicit entitlement checks preventing cross-tenant leakage**: Control evaluation never weakens existing tenant/workspace membership checks. Tenant-plane surfaces resolve tenant entitlement first, then evaluate the effective control state only for already-entitled users.

## Cross-Cutting / Shared Pattern Reuse *(mandatory when the feature touches notifications, status messaging, action links, header actions, dashboard signals/cards, alerts, navigation entry points, evidence/report viewers, or any other existing shared operator interaction family; otherwise write `N/A - no shared interaction family touched`)*

- **Cross-cutting feature?**: yes
- **Interaction class(es)**: header actions, runbook launch actions, provider-backed start gating, status messaging, audit prose
- **Systems touched**: system runbooks, tenant findings maintenance action, `FindingsLifecycleBackfillRunbookService::start()` plus its CLI and deploy-hook callers, restore execution start path, existing audit logging, operation start UX, existing capability enforcement helpers
- **Existing pattern(s) to extend**: `UiEnforcement`, `ProviderOperationStartGate`, `ProviderOperationStartResultPresenter`, `OperationUxPresenter`, `AuditRecorder`, `WorkspaceAuditLogger`, and existing system/tenant operation-link helpers
- **Shared contract / presenter / builder / renderer to reuse**: one new operational-control evaluator is allowed, but it must sit beside existing capability and provider-start gates instead of creating new page-local flag logic. Existing audit and start-result presenters remain authoritative for labels, reasons, action/result messaging, and truthful system-plane versus workspace-plane audit ownership.
- **Why the existing shared path is sufficient or insufficient**: existing shared paths already solve authorization, audit recording, and start-result UX. They are insufficient because none of them currently carry one central runtime-safety decision that can pause an action consistently across tenant and system surfaces.
- **Allowed deviation and why**: none. The first slice must remove the ad-hoc environment flag for in-scope maintenance actions rather than adding another exception path.
- **Consistency impact**: control labels, paused-state wording, reason display, audit action IDs, and allow/block semantics must match across the controls page, runbooks page, findings list header action, restore execution start flow, and any related notifications.
- **Review focus**: reviewers must block new direct `config(...)` or env-based runtime gates for in-scope operational controls and verify that findings lifecycle backfill routes through `FindingsLifecycleBackfillRunbookService::start()` plus the shared evaluator for UI, CLI, and deploy-hook callers, while restore continues through its existing start seam and presenters.

## OperationRun UX Impact *(mandatory when the feature creates, queues, deduplicates, resumes, blocks, completes, or deep-links to an `OperationRun`; otherwise write `N/A - no OperationRun start or link semantics touched`)*

- **Touches OperationRun start/completion/link UX?**: yes
- **Shared OperationRun UX contract/layer reused**: `OperationRunService`, `OperationUxPresenter`, `OpsUxBrowserEvents`, `ProviderOperationStartResultPresenter`, `OperationRunLinks`, and `SystemOperationRunLinks`
- **Delegated start/completion UX behaviors**: when an action is allowed, queued toast, `View run` link, run-enqueued browser event, dedupe-or-already-queued messaging, and tenant/workspace-safe URL resolution remain delegated to the existing shared paths. When a control blocks execution, the surface reuses the shared start-result or notification path for one explicit paused-state message and does not invent a second blocked-run dialect.
- **Local surface-owned behavior that remains**: initiation inputs, confirmation text, and scope selection remain local to the runbook page, findings list page, or restore workflow. Local code does not decide operational-control truth.
- **Queued DB-notification policy**: unchanged. This slice does not introduce new queued DB notifications for paused or allowed starts.
- **Terminal notification path**: unchanged central lifecycle mechanism for runs that do start.
- **Exception required?**: none

## Provider Boundary / Platform Core Check *(mandatory when the feature changes shared provider/platform seams, identity scope, governed-subject taxonomy, compare strategy selection, provider connection descriptors, or operator vocabulary that may leak provider-specific semantics into platform-core truth; otherwise write `N/A - no shared provider/platform boundary touched`)*

- **Shared provider/platform boundary touched?**: yes
- **Boundary classification**: mixed
- **Seams affected**: platform-core operational-control vocabulary, restore execution provider-start boundary, shared operator messaging for blocked execution
- **Neutral platform terms preserved or introduced**: operational control, effective state, paused, scope, reason, expiry, owner, override
- **Provider-specific semantics retained and why**: `restore.execute` remains a provider-owned operation and keeps its current Microsoft-only execution path and dry-run safeguards. The control system only governs whether that path may start; it does not rename or generalize restore semantics.
- **Why this does not deepen provider coupling accidentally**: the control catalog is platform-owned and names operation keys that already exist. Provider-specific behavior stays inside the existing restore-start path and provider registry.
- **Follow-up path**: none for the first slice; broader catalog growth remains a follow-up decision, not an implied obligation of this spec

## UI / Surface Guardrail Impact *(mandatory when operator-facing surfaces are changed; otherwise write `N/A`)*

| Surface / Change | Operator-facing surface change? | Native vs Custom | Shared-Family Relevance | State Layers Touched | Exception Needed? | Low-Impact / `N/A` Note |
|---|---|---|---|---|---|---|
| System ops controls surface | yes | Native Filament + shared primitives | status messaging, audit-backed actions, control state summary | page, card/action state, modal | no | New system-plane control center for a bounded first-slice catalog |
| System runbooks launcher | yes | Native Filament + shared runbook/start UX | run start messaging, confirmation flow, blocked-state messaging | page, action, preflight state | no | Existing page gains operational-control awareness only |
| Tenant findings list header action | yes | Native Filament + existing action-surface primitives | header actions, run start messaging | table, header action, modal | no | Existing maintenance action loses env-flag gating and becomes control-aware |
| Restore run create/start workflow | yes | Native Filament resource + shared provider start gate | provider-backed start result, disabled-state copy | form/wizard, create action, start-result state | no | Existing risky tenant workflow gains central pause semantics without new tenant-side control UI |

## Decision-First Surface Role *(mandatory when operator-facing surfaces are changed)*

| Surface | Decision Role | Human-in-the-loop Moment | Immediately Visible for First Decision | On-Demand Detail / Evidence | Why This Is Primary or Why Not | Workflow Alignment | Attention-load Reduction |
|---|---|---|---|---|---|---|---|
| System ops controls surface | Primary Decision Surface | Decide whether one risky feature should stay available, be paused, or be scoped down during an incident or rollout | control name, effective scope, paused/enabled state, reason, owner, expiry | change history, affected actions, audit links | Primary because this is the system-plane place where operators make the runtime-safety decision itself | Follows incident-control and rollout workflow, not feature storage structure | Replaces deploy-time or env-level toggling with one visible operational decision point |
| System runbooks launcher | Secondary Context Surface | Decide whether to preflight or start a runbook once the control state is already known | current control state, preflight, confirmation requirements, next safe action | existing run detail after start, control reason history | Secondary because the main decision here is execution of a specific runbook, not control management | Keeps runbook workflow intact while surfacing control truth inline | Avoids surprise 403s or silent disappearance when the runbook is paused |
| Tenant findings list header action | Secondary Context Surface | Decide whether to start tenant findings backfill | header action availability, paused-state message, tenant scope | run detail only after allowed start | Secondary because the list remains the primary findings workflow; runtime control is supporting context | Preserves list-first findings work while exposing truthful blocked state | Removes hidden env-driven behavior drift on one tenant surface |
| Restore run create/start workflow | Secondary Context Surface | Decide whether a restore may proceed now | effective control state, restore-specific next action, existing safety messaging | preview, diff, and run detail when allowed | Secondary because restore creation remains the main operator decision and control state is a gating constraint | Keeps restore workflow focused while making pause state explicit before execution | Prevents risky restore attempts from failing late or ambiguously |

## UI/UX Surface Classification *(mandatory when operator-facing surfaces are changed)*

| Surface | Action Surface Class | Surface Type | Likely Next Operator Action | Primary Inspect/Open Model | Row Click | Secondary Actions Placement | Destructive Actions Placement | Canonical Collection Route | Canonical Detail Route | Scope Signals | Canonical Noun | Critical Truth Visible by Default | Exception Type / Justification |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| System ops controls surface | Utility / System | Operational safety control center | Pause, resume, or scope a control | Explicit card action or modal from the page itself | forbidden | Secondary details live in card reveals or modal details only | State-changing control actions are confirmation-protected and stay on the control card/modal | `/system/ops/controls` | `/system/ops/controls` | Global, all-workspaces, or workspace-targeted scope | Operational controls / Operational control | Effective state, reason, scope, and expiry | none |
| System runbooks launcher | Monitoring / Queue / Workbench | System runbook launcher | Preflight or start a runbook | In-page action modal then `View run` after success | forbidden | Related navigation stays secondary in toast actions or page summaries | Dangerous execution remains in the existing `Run...` action with confirmation | `/system/ops/runbooks` | `/system/ops/runbooks` | Current control state and selected tenant/all-tenant scope | Runbooks / Runbook | Whether execution is allowed right now and why | none |
| Tenant findings list header action | List / Table / Bulk | List-first resource | Open findings or start lifecycle backfill | Existing table inspection remains primary; header action is explicit secondary execution | required | Secondary execution stays in the header only | Backfill remains confirmation-protected in the header action | `/admin/t/{tenant}/findings` | existing findings detail route | Tenant scope and effective control state for entitled users | Findings / Finding | Findings list truth plus explicit maintenance availability | none |
| Restore run create/start workflow | Wizard / Flow | Create and launch flow | Continue restore setup or stop because execution is paused | Existing create form/wizard remains primary | forbidden | Secondary navigation lives in helper links and post-start run links | Existing restore execution remains inside the create/start flow with its current safety steps | `/admin/t/{tenant}/restore-runs` | `/admin/t/{tenant}/restore-runs/{record}` | Tenant scope, preview/dry-run state, and effective control state | Restore runs / Restore run | Whether restore execution may proceed, with scope and reason | none |

## Operator Surface Contract *(mandatory when operator-facing surfaces are changed)*

| Surface | Primary Persona | Decision / Operator Action Supported | Surface Type | Primary Operator Question | Default-visible Information | Diagnostics-only Information | Status Dimensions Used | Mutation Scope | Primary Actions | Dangerous Actions |
|---|---|---|---|---|---|---|---|---|---|---|
| System ops controls surface | Platform operator / break-glass operator | Decide whether risky runtime actions remain enabled, paused globally, or paused for one workspace | Control center | Is this risky action allowed right now, for whom, and why? | control label, effective state, scope, reason, owner, expiry | audit history, exact affected surfaces, internal notes | runtime safety state, scope, expiry | TenantPilot only | Pause control, Resume control, Change scope | Pausing or resuming a control |
| System runbooks launcher | Platform operator | Decide whether to preflight/start the runbook or respect a pause control | Workbench | Can I run this runbook now? | current control state, runbook scope, preflight result, confirmation requirements | latest run detail and audit history after navigation | runtime safety state, execution readiness, preflight result | TenantPilot only when blocked; Microsoft tenant or tenant data changes only when allowed and executed | Preflight, Run | Run |
| Tenant findings list header action | Tenant manager / owner | Decide whether findings lifecycle backfill may start for the current tenant | List-first resource + secondary header action | Is lifecycle backfill available for this tenant right now? | explicit paused/enabled state for entitled users, tenant scope, action label | run detail only if execution is allowed and started | runtime safety state, execution readiness | TenantPilot only when blocked; tenant data mutation if execution is allowed | Backfill findings lifecycle | Backfill findings lifecycle |
| Restore run create/start workflow | Tenant manager / owner | Decide whether restore execution may proceed after existing safety checks | Guided creation flow | Can this restore execute now, or is the operation paused? | control state, restore scope, dry-run/preview state, next action | preview diff, post-start run detail, raw diagnostics after navigation | runtime safety state, lifecycle, restore safety/preflight state | TenantPilot only when blocked; Microsoft tenant when execution is allowed | Create restore run, Continue preview | Execute restore |

## Proportionality Review *(mandatory when structural complexity is introduced)*

- **New source of truth?**: yes
- **New persisted entity/table/artifact?**: yes
- **New abstraction?**: yes
- **New enum/state/reason family?**: yes, one bounded enabled/paused effective-state axis for the control contract
- **New cross-domain UI framework/taxonomy?**: no
- **Current operator problem**: Operators cannot safely pause already-existing risky actions without deploy-time flags, inconsistent UX, or page-local code branches.
- **Existing structure is insufficient because**: authorization and provider-start gates decide who may act, not whether the product should temporarily allow the action at all. The current env flag is invisible product truth and cannot cover system-plane plus tenant-plane paths consistently.
- **Narrowest correct implementation**: a code-owned two-control catalog plus persisted control activations, one evaluator, one management surface, and two concrete enforcement families (`findings.lifecycle.backfill`, `restore.execute`).
- **Ownership cost**: new runtime-safety records, audit action IDs, shared paused-state copy, evaluator tests, and guard coverage that blocks new ad-hoc runtime gates.
- **Alternative intentionally rejected**: keep using env flags, rely on full Laravel maintenance mode, or build a generic customer-facing feature-flag system. Env flags are too hidden, maintenance mode is too broad, and a generic flag platform is too large.
- **Release truth**: current-release truth

### Compatibility posture

This feature assumes a pre-production environment.

Backward compatibility, legacy aliases, migration shims, historical fixtures, and compatibility-specific tests are out of scope unless explicitly required by this spec.

Canonical replacement is preferred over preservation.

## Testing / Lane / Runtime Impact *(mandatory for runtime behavior changes)*

- **Test purpose / classification**: Unit, Feature
- **Validation lane(s)**: fast-feedback, confidence
- **Why this classification and these lanes are sufficient**: the feature introduces one shared evaluator plus a small number of concrete UI/service enforcement points. Unit tests prove effective-state resolution, scope precedence, expiry, and block reasons. Feature tests prove system-plane management, tenant-plane and system-plane blocked execution, audit logging, and unchanged 404/403 semantics without browser-specific behavior.
- **New or expanded test families**: focused operational-controls unit coverage, system-page management tests, findings-maintenance gate tests, restore-execution gate tests, and one guard test blocking new ad-hoc config gates for in-scope controls
- **Fixture / helper cost impact**: moderate. Tests reuse existing platform users, workspaces, tenants, OperationRun, restore-run, and findings fixtures. No new browser harness, provider emulator, or heavy governance suite is required.
- **Heavy-family visibility / justification**: none
- **Special surface test profile**: standard-native-filament, monitoring-state-page
- **Standard-native relief or required special coverage**: ordinary Filament feature coverage is sufficient for the controls page and the affected admin/system surfaces, plus explicit server-side assertions that blocked actions create no run or provider execution side effect, all-tenant blocked runbook attempts audit truthfully, and later control activation does not rewrite already accepted runs.
- **Reviewer handoff**: confirm that the env-gated findings action is now evaluator-driven, restore execution is blocked before queue/provider start, entitled-but-paused users see explicit operational-control messaging, non-entitled users still get 404 or 403 as appropriate, and audit entries record scope/reason/actor for control changes.
- **Budget / baseline / trend impact**: low-to-moderate increase in narrow unit and feature coverage only
- **Escalation needed**: none
- **Active feature PR close-out entry**: Guardrail
- **Planned validation commands**:
  - `export PATH="/bin:/usr/bin:/usr/local/bin:$PATH" && cd apps/platform && ./vendor/bin/sail artisan test --compact tests/Unit/Support/OperationalControls/OperationalControlCatalogTest.php tests/Unit/Support/OperationalControls/OperationalControlEvaluatorTest.php tests/Unit/Support/OperationalControls/OperationalControlScopeResolutionTest.php`
  - `export PATH="/bin:/usr/bin:/usr/local/bin:$PATH" && cd apps/platform && ./vendor/bin/sail artisan test --compact tests/Feature/System/OpsControls/OperationalControlManagementTest.php tests/Feature/System/OpsRunbooks/OperationalControlRunbookGateTest.php`
  - `export PATH="/bin:/usr/bin:/usr/local/bin:$PATH" && cd apps/platform && ./vendor/bin/sail artisan test --compact tests/Feature/Findings/OperationalControlFindingsBackfillGateTest.php tests/Feature/Restore/OperationalControlRestoreExecutionGateTest.php tests/Feature/OperationalControls/OperationalControlAuthorizationSemanticsTest.php`
  - `export PATH="/bin:/usr/bin:/usr/local/bin:$PATH" && cd apps/platform && ./vendor/bin/sail artisan test --compact tests/Feature/OperationalControls/NoAdHocOperationalControlBypassTest.php`

## User Scenarios & Testing *(mandatory)*

### User Story 1 - Pause A Risky Action Centrally (Priority: P1)

As a platform operator, I can pause one risky control from the system plane so the affected runbook and tenant surfaces stop allowing that action without requiring a deployment.

**Why this priority**: This is the operator-visible core of the feature and the main incident-response value.

**Independent Test**: Activate the control for `findings.lifecycle.backfill`, open the system controls surface and the affected runbook/findings surfaces, and verify that the action is visibly paused with one explicit reason and no execution path.

**Acceptance Scenarios**:

1. **Given** a platform operator pauses `findings.lifecycle.backfill` globally, **When** an entitled operator opens `/system/ops/runbooks` or an entitled tenant user opens `/admin/t/{tenant}/findings`, **Then** the action remains visible in its normal place but is explicitly blocked with paused-state messaging rather than disappearing or looking unauthorized.
2. **Given** the same control is resumed, **When** the affected surfaces reload, **Then** the normal execution path returns without a deploy or config-file change.

---

### User Story 2 - Block Execution Server-Side Without Masquerading As Auth (Priority: P1)

As an entitled operator, I want an active control to stop execution before any queued run or provider-backed action starts, while still preserving normal 404 and 403 authorization semantics.

**Why this priority**: The feature fails if it only hides UI or turns operational controls into fake authorization failures.

**Independent Test**: Activate controls for `findings.lifecycle.backfill` and `restore.execute`, attempt the affected actions through their normal pages, and assert that no `OperationRun` or provider-backed execution starts while entitlement and capability semantics remain unchanged.

**Acceptance Scenarios**:

1. **Given** an entitled tenant user has the underlying capability but `restore.execute` is paused for their workspace, **When** they attempt to start restore execution, **Then** the system returns explicit operational-control feedback, creates no new execution run, and makes no outbound provider call.
2. **Given** a user lacks workspace or tenant entitlement, **When** they attempt the same affected action, **Then** the system still responds as not found instead of revealing control-state details.
3. **Given** a user is entitled to the scope but lacks the underlying capability, **When** they attempt the affected action, **Then** the system still returns 403 rather than blaming the operational control.

---

### User Story 3 - Scope A Pause To One Workspace (Priority: P2)

As a platform operator, I can pause a risky control for one workspace without affecting unrelated workspaces, so incidents or staged rollouts stay bounded.

**Why this priority**: Workspace scoping is the smallest enterprise-capable version beyond a purely global kill switch and makes the feature reusable for future rollout control.

**Independent Test**: Create two workspaces, activate a workspace-scoped pause for one of them, and confirm that blocked behavior applies only to the targeted workspace while the other workspace continues normally.

**Acceptance Scenarios**:

1. **Given** `restore.execute` is paused for Workspace A only, **When** entitled users in Workspace A and Workspace B attempt restore execution, **Then** Workspace A is blocked with explicit paused-state messaging and Workspace B continues normally.
2. **Given** a workspace-scoped pause expires or is removed, **When** the targeted workspace retries the action, **Then** the action becomes available again without changing any unrelated workspace state.

### Edge Cases

- A workspace-scoped activation and a global activation may both exist for the same control; v1 precedence is global-first, and a matching global pause always wins.
- A control may expire while an operator is on the page; stale page state must not start a blocked action after expiry or removal.
- Break-glass platform access does not automatically bypass an operational control unless the spec explicitly authorizes that path in a later slice.
- An action may already be queued before a control is activated; the control governs new starts only and must not silently rewrite historical runs.
- Tenant/admin users who are not entitled to the workspace or tenant must not learn that a control exists for the hidden scope.
- The first slice must retire the in-scope env flag rather than leaving both the env gate and the control evaluator active in parallel.

## Requirements *(mandatory)*

**Constitution alignment (required):** This feature introduces no new Graph endpoint family, but it changes the start boundary for existing queued/provider-backed actions. For in-scope controls, the spec requires server-side enforcement before `findings.lifecycle.backfill` or `restore.execute` start, preserves existing confirmation/audit patterns, and keeps long-running work observable through the current `OperationRun` paths whenever execution is allowed.

**Constitution alignment (PROP-001 / ABSTR-001 / PERSIST-001 / STATE-001 / BLOAT-001):** This feature introduces a new runtime-safety truth because the current product already needs it now: operators must pause risky actions without deploy-time env changes. The shape stays narrow: a two-key catalog, persisted activations, and one evaluator. It does not become a generalized experimentation, entitlement, or customer flag platform.

**Constitution alignment (XCUT-001):** The slice is cross-cutting across header actions, runbook starts, provider-backed start gates, and audit messaging. It reuses `UiEnforcement`, `ProviderOperationStartGate`, existing OperationRun UX presenters, and `WorkspaceAuditLogger` rather than introducing local blocked-state dialects.

**Constitution alignment (PROV-001):** `restore.execute` remains provider-owned, while operational-control vocabulary remains platform-core. The spec keeps provider specifics inside the existing restore path and uses neutral control language for scope, reason, and effective state.

**Constitution alignment (TEST-GOV-001):** Proof stays in narrow unit and feature coverage. No browser or heavy-governance family is justified. Reviewer handoff must explicitly verify lane fit, unchanged 404/403 semantics, and no hidden provider-side effects on blocked paths.

**Constitution alignment (OPS-UX):** For starts that are still allowed, the default Ops-UX 3-surface contract remains unchanged. `OperationRun.status` and `OperationRun.outcome` transitions remain service-owned. Paused starts create no `OperationRun`, no queued DB notification, and no new summary-count semantics.

**Constitution alignment (OPS-UX-START-001):** The feature includes the `OperationRun UX Impact` section and reuses the shared start UX paths. Local surfaces remain responsible only for initiation inputs and page-local confirmation text. Blocked-state feedback is delivered through existing result/notification helpers instead of page-local composition.

**Constitution alignment (RBAC-UX):** This slice spans the platform `/system` plane for control management and the tenant/admin `/admin` plane for affected execution surfaces. Cross-plane access remains 404. Non-members or non-entitled users receive 404. Members lacking the underlying capability receive 403 and keep the existing surface-specific capability-denied UX rather than paused-state helper text. Entitled users blocked only by an active control receive explicit operational-control feedback distinct from authorization failure. All mutating management actions require a staged safety flow with scope-impact preview, server-side capability checks, and confirmation. Break-glass does not bypass an active control in v1.

**Constitution alignment (OPS-EX-AUTH-001):** Not applicable.

**Constitution alignment (BADGE-001):** If paused/enabled state is rendered as a badge or status chip, it must use centralized badge rendering or one shared control-state presentation path, not page-local color decisions.

**Constitution alignment (UI-FIL-001):** The controls page, runbooks page, findings page, and restore flow remain native Filament surfaces using existing action, section, infolist, and notification primitives. No local status card framework or custom blocked-state component library is introduced.

**Constitution alignment (UI-NAMING-001):** Primary operator-facing labels use stable verbs and nouns: `Pause control`, `Resume control`, `Operational controls`, `Backfill findings lifecycle`, and `Restore execution`. `Restore execution` is the control and status label, while `Execute restore` is the gated action label. Route-entry labels such as `New restore run` and `Create restore run` refer only to the ungated draft/setup flow. The same vocabulary must be preserved across buttons, modal titles, notifications, and audit prose.

**Constitution alignment (DECIDE-001):** The new system controls surface is the only new primary decision surface. Runbooks, findings, and restore remain secondary execution contexts that surface control truth inline instead of becoming separate troubleshooting flows.

**Constitution alignment (UI-CONST-001 / UI-SURF-001 / ACTSURF-001 / UI-HARD-001 / UI-EX-001 / UI-REVIEW-001 / HDR-001):** The controls page acts as a bounded control center with explicit action buttons and no competing inspect model. Existing runbooks, findings, and restore surfaces preserve their current primary inspect/open paths and action hierarchies while gaining one truthful blocked-state branch.

**Constitution alignment (ACTSURF-001 - action hierarchy):** Control management actions remain separated from navigation and diagnostics. The controls page owns pause/resume management. Runbooks, findings, and restore keep execution actions local but do not own control truth.

**Constitution alignment (OPSURF-001):** Default-visible content stays operator-first: whether the action is allowed, for which scope, and why. Raw internal control records or configuration internals stay secondary. Each affected execution action must state its mutation scope before execution when allowed, and blocked paths must state that no tenant/provider mutation will occur.

**Constitution alignment (UI-SEM-001 / LAYER-001 / TEST-TRUTH-001):** The spec allows one new evaluator because existing direct domain-to-UI mapping cannot express runtime-safety state consistently across system and tenant surfaces. No second presenter taxonomy or explanation framework is added beyond the minimum blocked-state copy path.

**Constitution alignment (Filament Action Surfaces):** The Action Surface Contract remains satisfied on all touched Filament surfaces. Each affected surface keeps one primary inspect/open model, no redundant `View` action is added, no empty action groups are introduced, and state-changing control actions require confirmation.

**Constitution alignment (UX-001 - Layout & Information Architecture):** The new controls page uses native Filament sections/cards for control summaries and action modals. Existing runbooks, findings, and restore pages keep their established layout patterns. Any blocked-state summary remains within the current page structure and does not add ad-hoc full-page exception layouts.

### Functional Requirements

- **FR-001**: System MUST define a central operational-control catalog for the first-slice keys `findings.lifecycle.backfill` and `restore.execute`.
- **FR-002**: Platform operators MUST be able to activate, update, and remove a control for all workspaces or one specific workspace from the system plane, with a human-readable reason and optional expiry, through a staged safety flow that previews scope impact before confirmation.
- **FR-003**: System MUST enforce the effective control state server-side before any in-scope findings lifecycle backfill start at `FindingsLifecycleBackfillRunbookService::start()`, any affected maintenance action, or any provider-backed restore execution begins.
- **FR-004**: System MUST show explicit paused-state feedback to entitled users on affected surfaces and MUST keep that feedback distinct from authorization failure.
- **FR-005**: System MUST preserve existing 404 vs 403 semantics for non-membership and missing capability checks even when a control is active, and capability-denied members MUST follow the existing surface-specific denial UX rather than operational-control helper text.
- **FR-006**: System MUST create no new `OperationRun`, no queued execution `RestoreRun`, no queued job, and no outbound provider execution when an in-scope action is blocked by an active control, and MUST NOT retroactively mutate already accepted or historical runs when a control is activated later.
- **FR-007**: System MUST audit every control activation, update, removal, and blocked execution decision with stable action IDs, actor, scope, reason, and timestamp; global control changes MUST be recorded as platform-plane audit events without assigning a false workspace or tenant owner, and blocked system-plane attempts without a concrete workspace or tenant MUST be recorded as platform-plane events with requested-scope metadata.
- **FR-008**: The findings-maintenance action currently gated by `config('tenantpilot.allow_admin_maintenance_actions')` MUST be migrated to the shared operational-control path and the local env gate retired for this in-scope behavior.
- **FR-009**: System MUST expose enough effective-state information on the controls page and affected execution surfaces to make the operator's next action clear without opening raw config or database detail.

## UI Action Matrix *(mandatory when Filament is changed)*

If this feature adds/modifies any Filament Resource / RelationManager / Page, fill out the matrix below.

For each surface, list the exact action labels, whether they are destructive (confirmation? typed confirmation?),
RBAC gating (capability + enforcement helper), whether the mutation writes an audit log, and any exemption or exception used.

| Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions |
|---|---|---|---|---|---|---|---|---|---|---|
| System ops controls surface | app/Filament/System/Pages/Ops/Controls.php | `Pause control`, `Resume control`, `Edit scope` (scope-impact preview + confirmation required for state changes) | Same-page control cards or modals; no row-click model | none beyond card actions | none | none in v1 | same-page actions only | `Review impact`, `Save changes`, `Cancel` in staged modal forms | yes | Management is platform-plane only; global changes audit as platform-plane events without workspace/tenant ownership; system users missing `platform.ops.controls.manage` receive 403 before page content renders |
| System runbooks launcher | app/Filament/System/Pages/Ops/Runbooks.php | `Preflight`, `Run...` | Same-page action modal and `View run` toast action | none | none | none | none | `Run`, `Cancel` in modal | yes | Existing start UX retained; blocked execution decisions are always audited |
| Findings list page | app/Filament/Resources/FindingResource/Pages/ListFindings.php | `Backfill findings lifecycle` (confirmation required) | Existing findings inspection model unchanged | unchanged | unchanged | unchanged | unchanged | N/A | yes | In-scope change replaces env gating with control evaluation and blocked execution audit |
| Restore run resource | app/Filament/Resources/RestoreRunResource.php | `New restore run` | Existing clickable-row/resource inspection model unchanged | existing row actions unchanged | existing grouped maintenance actions unchanged | existing empty-state CTA unchanged | existing view header unchanged | `Create restore run`, `Cancel` plus existing safety steps | yes | In-scope change gates only the `Execute restore` step inside the create flow; draft/setup labels and existing row/view actions remain unchanged |

### Key Entities *(include if feature involves data)*

- **Operational Control Definition**: The bounded catalog entry that identifies one controllable risky action, its canonical key, operator label, supported scopes, and default behavior.
- **Operational Control Activation**: The runtime safety record that pauses a control for either all workspaces or one specific workspace, including reason, optional expiry, and an owner display that resolves to the last mutating actor (`updated_by` when present, otherwise `created_by`).
- **Operational Control Decision**: The derived evaluation result returned to affected surfaces and service boundaries, including effective state, matched scope, reason, and whether execution may proceed.

## Success Criteria *(mandatory)*

### Measurable Outcomes

- **SC-001**: In timed manual smoke, platform operators can pause or resume either first-slice control from the system plane in under 1 minute without editing environment variables, code, or database rows manually.
- **SC-002**: In blocked validation scenarios, 100% of attempted in-scope starts create no new execution run and no outbound provider-backed execution for the targeted scope.
- **SC-003**: In validation scenarios covering the affected surfaces, entitled users see explicit paused-state feedback on the first attempt in 100% of cases, while non-entitled users still receive 404 or 403 semantics as defined by RBAC rules.
- **SC-004**: Workspace-scoped activation affects only the targeted workspace in validation scenarios and leaves at least one non-targeted workspace unaffected for the same control.