TenantAtlas/specs/113-platform-ops-runbooks/spec.md
ahmido 200498fa8e feat(113): Platform Ops Runbooks — UX Polish (Filament-native, system theme, live scope) (#137)
## Summary

Implements and polishes the Platform Ops Runbooks feature (Spec 113) — the operator control plane for safe backfills and data repair from `/system`.

## Changes

### UX Polish (Phase 7 — US4)
- **Filament-native components**: Rewrote `runbooks.blade.php` and `view-run.blade.php` using `<x-filament::section>` instead of raw Tailwind div cards. Cards now render correctly with Filament's built-in borders, shadows and dark mode.
- **System panel theme**: Created `resources/css/filament/system/theme.css` and registered `->viteTheme()` on `SystemPanelProvider`. The system panel previously had no theme CSS registered — Tailwind utility classes weren't compiled for its views, causing the warning icon SVG to expand to full container size.
- **Live scope selector**: Added `->live()` to the scope `Radio` field so "Single tenant" immediately reveals the tenant search dropdown without requiring a Submit first.

### Core Feature (Phases 1–6, previously shipped)
- `/system/ops/runbooks` — runbook catalog, preflight, run with typed confirmation + reason
- `/system/ops/runs` — run history table with status/outcome badges
- `/system/ops/runs/{id}` — run detail view with summary counts, failures, collapsible context
- `FindingsLifecycleBackfillRunbookService` — preflight + execution logic
- AllowedTenantUniverse — scopes tenant picker to non-platform tenants only
- RBAC: `platform.ops.view`, `platform.runbooks.view`, `platform.runbooks.run`, `platform.runbooks.findings.lifecycle_backfill`
- Rate-limited `/system/login` (10/min per IP+username)
- Distinct session cookie for `/system` isolation

## Test Coverage
- 16 tests / 141 assertions — all passing
- Covers: page access, RBAC, preflight, run dispatch, scope selector, run detail, run list

## Checklist
- [x] Filament v5 / Livewire v4 compliant
- [x] Provider registered in `bootstrap/providers.php`
- [x] Destructive actions require confirmation (`->requiresConfirmation()`)
- [x] System panel theme registered (`viteTheme`)
- [x] Pint clean
- [x] Tests pass

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #137
2026-02-27 01:11:25 +00:00

191 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Feature Specification: Platform Ops Runbooks (Operator Control Plane) for Backfills & Data Repair
**Feature Branch**: `[113-platform-ops-runbooks]`
**Created**: 2026-02-26
**Status**: Draft
**Input**: Operator control plane runbooks for safe backfills and data repair; deploy-time automatic execution; operator re-run via `/system`; never exposed in customer UI.
## Clarifications
### Session 2026-02-26
- Q: `/system` Session Isolation Strategy (v1) → A: B — Use a distinct session cookie name/config for `/system`.
- Q: `OperationRun.type` for the findings lifecycle backfill runbook → A: Use `findings.lifecycle.backfill` (consistent with the operation catalog). Runbook trigger is exclusive to `/system`; any `/admin` trigger is removed / feature-flagged off.
- Q: v1 scope selector for running the runbook → A: All tenants (default) + Single tenant (picker).
- Q: Failure notification delivery (v1) → A: Deliver via existing alert destinations (Teams webhook / Email) when configured, and always notify the initiating platform operator in-app.
- Q: `/system/login` rate limiting policy (v1) → A: 10/min per IP + username (combined key).
- Q: Platform “allowed tenant universe” (v1) → A: All non-platform tenants present in the database (`tenants.external_id != 'platform'`). The System UI must not allow selecting or targeting the platform tenant.
## Spec Scope Fields *(mandatory)*
- **Scope**: canonical-view (platform control plane)
- **Primary Routes**:
- `/system/ops/runbooks` (runbook catalog + preflight + run)
- `/system/ops/runs` (run history + run details)
- `/admin/*` (explicitly remove any maintenance/backfill affordances)
- **Data Ownership**:
- Tenant-owned customer data that may be modified by runbooks (e.g., “findings” lifecycle/workflow fields)
- Platform-owned operational records (operation runs, audit events, operator notifications)
- **RBAC**:
- Platform identity only (separate from tenant users)
- Capabilities (v1 minimum): `platform.ops.view`, `platform.runbooks.view`, `platform.runbooks.run`
- Optional granular capability for this runbook: `platform.runbooks.findings.lifecycle_backfill`
For canonical-view specs, the spec MUST define:
- **Default filter behavior when tenant-context is active**: the runbook defaults to **All tenants** scope; if a tenant is explicitly selected, all counts/changes MUST be limited to that tenant only.
- **Explicit entitlement checks preventing cross-tenant leakage**: a tenant-context user MUST NOT be able to access `/system/*` (deny-as-not-found). Platform operators MUST only be able to target tenants within the platforms allowed tenant universe.
## User Scenarios & Testing *(mandatory)*
### User Story 1 - Operator runs a runbook safely (Priority: P1)
As a platform operator, I can run a predefined “Rebuild Findings Lifecycle” runbook from `/system` with a clear preflight, explicit confirmation, and an audited, trackable run record.
**Why this priority**: This is the primary operator workflow that eliminates the need for SSH/manual scripts and reduces risk for customer-impacting data changes.
**Independent Test**: Fully testable by visiting `/system/ops/runbooks`, running preflight, starting a run, and verifying the run record + audit events exist.
**Acceptance Scenarios**:
1. **Given** an authorized platform operator, **When** they open `/system/ops/runbooks`, **Then** they see the runbook catalog including “Rebuild Findings Lifecycle” and an operator warning that actions may modify customer data.
2. **Given** preflight reports `affected_count > 0`, **When** the operator confirms the run, **Then** a new operation run is created and the UI links to “View run”.
3. **Given** preflight reports `affected_count = 0`, **When** the operator attempts to run, **Then** the run action is disabled with a clear “Nothing to do” explanation.
4. **Given** the operator chooses “All tenants”, **When** they confirm, **Then** typed confirmation is required (e.g., entering `BACKFILL`) and a reason is required.
---
### User Story 2 - Customers never see maintenance actions (Priority: P1)
As a tenant (customer) user, I never see backfill/repair buttons and cannot access the operator control plane.
**Why this priority**: Exposing maintenance controls in customer UI is an enterprise anti-pattern and undermines product trust.
**Independent Test**: Fully testable by checking `/admin` UI surfaces and attempting direct navigation to `/system/*` as a tenant user.
**Acceptance Scenarios**:
1. **Given** a tenant user session, **When** the user requests `/system/ops/runbooks`, **Then** the response is **404** (deny-as-not-found).
2. **Given** production-like configuration, **When** a tenant user views relevant `/admin` screens, **Then** there is no maintenance/backfill/repair UI.
---
### User Story 3 - Same logic for deploy-time and operator re-run (Priority: P2)
As a platform operator and as a deploy pipeline, the same runbook logic can be executed consistently so that deploy-time backfills are automatic, and manual re-runs remain available and safe.
**Why this priority**: A single execution path reduces drift between “what deploy does” and “what operators can re-run”, and improves reliability.
**Independent Test**: Fully testable by running the operation twice and verifying idempotency and consistent preflight/run results for the same scope.
**Acceptance Scenarios**:
1. **Given** the runbook was executed once successfully, **When** it is executed again with the same scope, **Then** the second run reports `updated_count = 0` (idempotent behavior).
### Edge Cases
- Lock already held: another run is in-progress for the same scope (All tenants or the same tenant).
- Large dataset: preflight must remain fast enough for operator use; writes must be chunked to avoid long locks.
- Partial failure: some tenants/records fail while others succeed; run outcome and audit still record what happened.
- Missing reason: an All-tenants or break-glass run cannot start without a reason.
## Requirements *(mandatory)*
### Constitution alignment notes
- **No customer-plane maintenance**: Any maintenance/backfill/repair affordance in `/admin` is explicitly out of scope for customer UX.
- **Run observability**: Customer-impacting writes MUST be executed as a tracked operation run with clear status/outcome and operator-facing surfaces.
- **Safety gates**: Preflight → explicit confirmation → audited execution is mandatory.
### Functional Requirements
- **FR-001 (Remove Customer Exposure)**: The system MUST not expose any backfill/repair controls in `/admin` in production-like environments. Any legacy `/admin` trigger for the findings lifecycle backfill MUST be removed or disabled (feature-flag off by default).
- **FR-002 (Runbook Catalog)**: The system MUST provide a `/system/ops/runbooks` catalog listing predefined runbooks and their descriptions.
- **FR-003 (Runbook: Rebuild Findings Lifecycle)**: The system MUST provide a runbook that supports:
- Preflight (read-only) showing at least `affected_count`.
- Run (write) that starts a tracked operation run and links to “View run”.
- Scope selection: All tenants (default) and Single tenant (picker).
- Safe confirmation: includes scope + preflight count + “modifies customer data” warning.
- Typed confirmation for All-tenants scope (e.g., `BACKFILL`).
- Run disabled when preflight indicates nothing to do.
- **FR-004 (Single Source of Truth)**: The system MUST implement the runbook logic once and reuse it across:
- deploy-time execution (automation)
- operator UI execution in `/system`
The two paths MUST produce consistent results for the same scope.
- **FR-005 (Operation Run Tracking)**: Each run MUST create a run record including:
- run type identifier: `findings.lifecycle.backfill`
- scope (all tenants vs single tenant)
- actor (platform user, including break-glass marker when applicable)
- outcome/status transitions owned by the service layer
- numeric summary counts using a centralized allow-list of keys
- run context containing: `preflight.affected_count`, `updated_count`, `skipped_count`, `error_count`, and duration
- **FR-006 (Audit Events)**: The system MUST write audit events for start, completion, and failure. Audit writing MUST be fail-safe (audit failures do not crash the operation run).
- **FR-007 (Reasons for Sensitive Runs)**: All-tenants runs and break-glass runs MUST require a reason:
- `reason_code`: one of `DATA_REPAIR`, `INCIDENT`, `SUPPORT`, `SECURITY`
- `reason_text`: free text (max 500 characters)
- **FR-008 (Locking & Idempotency)**: The system MUST prevent concurrent runs for the same scope via locking and MUST be idempotent (a second execution does not re-write already-correct data).
- **FR-009 (Operator Notification on Failure)**: A failed run MUST notify operator targets with run type + scope + a link to “View run”. v1 delivery:
- If alert destinations are configured, deliver via existing destinations (Teams webhook / Email).
- Always notify the initiating platform operator in-app.
Success notifications are optional and SHOULD be off by default.
### Security & Non-Functional Requirements
- **SR-001 (Control Plane Isolation)**: `/system` MUST be isolated to platform identity and MUST deny tenant-plane access as **404** (anti-enumeration).
- **SR-002 (404 vs 403 Semantics)**:
- Non-platform users or wrong plane → **404**
- Platform user lacking required capability → **403**
- **SR-003 (Login Throttling)**: The `/system/login` surface MUST be rate limited at **10/min per IP + username (combined key)** and failed login attempts MUST be audited.
- **SR-004 (Session Isolation Strategy)**: v1 MUST isolate control plane sessions from tenant sessions by using a distinct session cookie name/config for `/system` (same domain). A dedicated subdomain with separate cookie scope may be introduced later.
- **SR-005 (Break-glass Visibility & Enforcement)**: Break-glass mode MUST be visually obvious and MUST require a reason; break-glass usage MUST be recorded on the run and in audit.
- **NFR-001 (Performance & Safety)**:
- Preflight MUST be read-only and cheap enough for interactive use.
- Writes MUST be chunked and resilient to partial failures.
## UI Action Matrix *(mandatory when Filament is changed)*
| Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions |
|---|---|---|---|---|---|---|---|---|---|---|
| Runbooks | `/system/ops/runbooks` | `Preflight` (read-only), `Run…` (write, confirm) | N/A | `View run` (after start) | None | None | N/A | N/A | Yes | `Run…` requires confirmation; typed confirm + reason required for All tenants. |
| Operation Runs | `/system/ops/runs` | N/A | List links to run detail (“View run”) | `View` | None | None | N/A | N/A | Yes | Run detail includes scope, actor, counts, outcome/status. |
### Key Entities *(include if feature involves data)*
- **Runbook**: A predefined operator action with preflight and run behavior.
- **Operation Run**: A tracked execution record storing scope, actor, status/outcome, and summary counts.
- **Audit Event**: Immutable security/ops log entries for preflight/run lifecycle.
- **Operator Notification**: A delivery record/target for failure alerts.
- **Finding**: Tenant-owned record whose lifecycle/workflow fields may be backfilled.
### User Story 4 - Enterprise-grade UX polish for Ops surfaces (Priority: P2)
As a platform operator, the Ops surfaces should look and feel enterprise-grade with proper visual hierarchy, alert banners, structured card layouts, badge indicators, and metadata so I can quickly assess system state.
**Why this priority**: Operator trust and efficiency depend on clear, scannable UI. Raw text and flat layouts slow down triage.
**Acceptance Scenarios**:
1. **Given** the operator opens `/system/ops/runbooks`, **Then** the operator warning is rendered as a styled alert banner with icon (not plain text).
2. **Given** runbooks are listed, **Then** each runbook is rendered as a structured card with title, description, scope badge, and "Last run" metadata when available.
3. **Given** preflight results are displayed, **Then** stat values use consistent stat-card styling with labels and prominent values.
4. **Given** the operator opens `/system/ops/runs/{id}`, **Then** status and outcome are rendered as colored badges (consistent with the existing BadgeRenderer), and scope is shown as a badge/tag.
5. **Given** the run detail page, **Then** summary counts are rendered as a labeled grid (not only raw JSON).
### Functional Requirements (UX Polish)
- **FR-010 (Operator Warning Banner)**: The operator warning on `/system/ops/runbooks` MUST be rendered as a visually distinct alert banner with an `exclamation-triangle` icon, amber/warning coloring, and clear heading — matching project alert patterns.
- **FR-011 (Runbook Card Layout)**: Each runbook MUST be rendered as a card with: title (semibold), description, scope badge (e.g., "All tenants"), and optional "Last run" timestamp + status badge when a previous run exists.
- **FR-012 (Preflight Stat Cards)**: Preflight result values (affected, total scanned, estimated tenants) MUST be rendered in visually prominent stat cards with labeled headers.
- **FR-013 (Run Detail Badges)**: Status and outcome on run detail pages MUST use the existing `BadgeRenderer` / `BadgeCatalog` system for colored badges with icons.
- **FR-014 (Run Detail Summary Grid)**: Summary counts on run detail MUST be rendered as a labeled key-value grid, not a raw JSON dump (JSON viewer remains available as a disclosure fallback).
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-001**: In production-like environments, customers have **zero** UI affordances to trigger backfills/repairs in `/admin`.
- **SC-002**: A platform operator can start a runbook without SSH and reach "View run" in **≤ 3 user interactions** from `/system/ops/runbooks`.
- **SC-003**: 100% of run attempts result in an operation run record and start/completion/failure audit events (with failure still recorded even if notifications fail).
- **SC-004**: Re-running the same runbook on the same scope after completion results in `updated_count = 0` (idempotency).
- **SC-005**: Operator warning on runbooks page renders as a styled alert banner (not plain text).