TenantAtlas/specs/113-platform-ops-runbooks/spec.md
ahmido 200498fa8e feat(113): Platform Ops Runbooks — UX Polish (Filament-native, system theme, live scope) (#137)
## Summary

Implements and polishes the Platform Ops Runbooks feature (Spec 113) — the operator control plane for safe backfills and data repair from `/system`.

## Changes

### UX Polish (Phase 7 — US4)
- **Filament-native components**: Rewrote `runbooks.blade.php` and `view-run.blade.php` using `<x-filament::section>` instead of raw Tailwind div cards. Cards now render correctly with Filament's built-in borders, shadows and dark mode.
- **System panel theme**: Created `resources/css/filament/system/theme.css` and registered `->viteTheme()` on `SystemPanelProvider`. The system panel previously had no theme CSS registered — Tailwind utility classes weren't compiled for its views, causing the warning icon SVG to expand to full container size.
- **Live scope selector**: Added `->live()` to the scope `Radio` field so "Single tenant" immediately reveals the tenant search dropdown without requiring a Submit first.

### Core Feature (Phases 1–6, previously shipped)
- `/system/ops/runbooks` — runbook catalog, preflight, run with typed confirmation + reason
- `/system/ops/runs` — run history table with status/outcome badges
- `/system/ops/runs/{id}` — run detail view with summary counts, failures, collapsible context
- `FindingsLifecycleBackfillRunbookService` — preflight + execution logic
- AllowedTenantUniverse — scopes tenant picker to non-platform tenants only
- RBAC: `platform.ops.view`, `platform.runbooks.view`, `platform.runbooks.run`, `platform.runbooks.findings.lifecycle_backfill`
- Rate-limited `/system/login` (10/min per IP+username)
- Distinct session cookie for `/system` isolation

## Test Coverage
- 16 tests / 141 assertions — all passing
- Covers: page access, RBAC, preflight, run dispatch, scope selector, run detail, run list

## Checklist
- [x] Filament v5 / Livewire v4 compliant
- [x] Provider registered in `bootstrap/providers.php`
- [x] Destructive actions require confirmation (`->requiresConfirmation()`)
- [x] System panel theme registered (`viteTheme`)
- [x] Pint clean
- [x] Tests pass

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #137
2026-02-27 01:11:25 +00:00

14 KiB
Raw Blame History

Feature Specification: Platform Ops Runbooks (Operator Control Plane) for Backfills & Data Repair

Feature Branch: [113-platform-ops-runbooks]
Created: 2026-02-26
Status: Draft
Input: Operator control plane runbooks for safe backfills and data repair; deploy-time automatic execution; operator re-run via /system; never exposed in customer UI.

Clarifications

Session 2026-02-26

  • Q: /system Session Isolation Strategy (v1) → A: B — Use a distinct session cookie name/config for /system.
  • Q: OperationRun.type for the findings lifecycle backfill runbook → A: Use findings.lifecycle.backfill (consistent with the operation catalog). Runbook trigger is exclusive to /system; any /admin trigger is removed / feature-flagged off.
  • Q: v1 scope selector for running the runbook → A: All tenants (default) + Single tenant (picker).
  • Q: Failure notification delivery (v1) → A: Deliver via existing alert destinations (Teams webhook / Email) when configured, and always notify the initiating platform operator in-app.
  • Q: /system/login rate limiting policy (v1) → A: 10/min per IP + username (combined key).
  • Q: Platform “allowed tenant universe” (v1) → A: All non-platform tenants present in the database (tenants.external_id != 'platform'). The System UI must not allow selecting or targeting the platform tenant.

Spec Scope Fields (mandatory)

  • Scope: canonical-view (platform control plane)
  • Primary Routes:
    • /system/ops/runbooks (runbook catalog + preflight + run)
    • /system/ops/runs (run history + run details)
    • /admin/* (explicitly remove any maintenance/backfill affordances)
  • Data Ownership:
    • Tenant-owned customer data that may be modified by runbooks (e.g., “findings” lifecycle/workflow fields)
    • Platform-owned operational records (operation runs, audit events, operator notifications)
  • RBAC:
    • Platform identity only (separate from tenant users)
    • Capabilities (v1 minimum): platform.ops.view, platform.runbooks.view, platform.runbooks.run
    • Optional granular capability for this runbook: platform.runbooks.findings.lifecycle_backfill

For canonical-view specs, the spec MUST define:

  • Default filter behavior when tenant-context is active: the runbook defaults to All tenants scope; if a tenant is explicitly selected, all counts/changes MUST be limited to that tenant only.
  • Explicit entitlement checks preventing cross-tenant leakage: a tenant-context user MUST NOT be able to access /system/* (deny-as-not-found). Platform operators MUST only be able to target tenants within the platforms allowed tenant universe.

User Scenarios & Testing (mandatory)

User Story 1 - Operator runs a runbook safely (Priority: P1)

As a platform operator, I can run a predefined “Rebuild Findings Lifecycle” runbook from /system with a clear preflight, explicit confirmation, and an audited, trackable run record.

Why this priority: This is the primary operator workflow that eliminates the need for SSH/manual scripts and reduces risk for customer-impacting data changes.

Independent Test: Fully testable by visiting /system/ops/runbooks, running preflight, starting a run, and verifying the run record + audit events exist.

Acceptance Scenarios:

  1. Given an authorized platform operator, When they open /system/ops/runbooks, Then they see the runbook catalog including “Rebuild Findings Lifecycle” and an operator warning that actions may modify customer data.
  2. Given preflight reports affected_count > 0, When the operator confirms the run, Then a new operation run is created and the UI links to “View run”.
  3. Given preflight reports affected_count = 0, When the operator attempts to run, Then the run action is disabled with a clear “Nothing to do” explanation.
  4. Given the operator chooses “All tenants”, When they confirm, Then typed confirmation is required (e.g., entering BACKFILL) and a reason is required.

User Story 2 - Customers never see maintenance actions (Priority: P1)

As a tenant (customer) user, I never see backfill/repair buttons and cannot access the operator control plane.

Why this priority: Exposing maintenance controls in customer UI is an enterprise anti-pattern and undermines product trust.

Independent Test: Fully testable by checking /admin UI surfaces and attempting direct navigation to /system/* as a tenant user.

Acceptance Scenarios:

  1. Given a tenant user session, When the user requests /system/ops/runbooks, Then the response is 404 (deny-as-not-found).
  2. Given production-like configuration, When a tenant user views relevant /admin screens, Then there is no maintenance/backfill/repair UI.

User Story 3 - Same logic for deploy-time and operator re-run (Priority: P2)

As a platform operator and as a deploy pipeline, the same runbook logic can be executed consistently so that deploy-time backfills are automatic, and manual re-runs remain available and safe.

Why this priority: A single execution path reduces drift between “what deploy does” and “what operators can re-run”, and improves reliability.

Independent Test: Fully testable by running the operation twice and verifying idempotency and consistent preflight/run results for the same scope.

Acceptance Scenarios:

  1. Given the runbook was executed once successfully, When it is executed again with the same scope, Then the second run reports updated_count = 0 (idempotent behavior).

Edge Cases

  • Lock already held: another run is in-progress for the same scope (All tenants or the same tenant).
  • Large dataset: preflight must remain fast enough for operator use; writes must be chunked to avoid long locks.
  • Partial failure: some tenants/records fail while others succeed; run outcome and audit still record what happened.
  • Missing reason: an All-tenants or break-glass run cannot start without a reason.

Requirements (mandatory)

Constitution alignment notes

  • No customer-plane maintenance: Any maintenance/backfill/repair affordance in /admin is explicitly out of scope for customer UX.
  • Run observability: Customer-impacting writes MUST be executed as a tracked operation run with clear status/outcome and operator-facing surfaces.
  • Safety gates: Preflight → explicit confirmation → audited execution is mandatory.

Functional Requirements

  • FR-001 (Remove Customer Exposure): The system MUST not expose any backfill/repair controls in /admin in production-like environments. Any legacy /admin trigger for the findings lifecycle backfill MUST be removed or disabled (feature-flag off by default).
  • FR-002 (Runbook Catalog): The system MUST provide a /system/ops/runbooks catalog listing predefined runbooks and their descriptions.
  • FR-003 (Runbook: Rebuild Findings Lifecycle): The system MUST provide a runbook that supports:
    • Preflight (read-only) showing at least affected_count.
    • Run (write) that starts a tracked operation run and links to “View run”.
    • Scope selection: All tenants (default) and Single tenant (picker).
    • Safe confirmation: includes scope + preflight count + “modifies customer data” warning.
    • Typed confirmation for All-tenants scope (e.g., BACKFILL).
    • Run disabled when preflight indicates nothing to do.
  • FR-004 (Single Source of Truth): The system MUST implement the runbook logic once and reuse it across:
    • deploy-time execution (automation)
    • operator UI execution in /system The two paths MUST produce consistent results for the same scope.
  • FR-005 (Operation Run Tracking): Each run MUST create a run record including:
    • run type identifier: findings.lifecycle.backfill
    • scope (all tenants vs single tenant)
    • actor (platform user, including break-glass marker when applicable)
    • outcome/status transitions owned by the service layer
    • numeric summary counts using a centralized allow-list of keys
    • run context containing: preflight.affected_count, updated_count, skipped_count, error_count, and duration
  • FR-006 (Audit Events): The system MUST write audit events for start, completion, and failure. Audit writing MUST be fail-safe (audit failures do not crash the operation run).
  • FR-007 (Reasons for Sensitive Runs): All-tenants runs and break-glass runs MUST require a reason:
    • reason_code: one of DATA_REPAIR, INCIDENT, SUPPORT, SECURITY
    • reason_text: free text (max 500 characters)
  • FR-008 (Locking & Idempotency): The system MUST prevent concurrent runs for the same scope via locking and MUST be idempotent (a second execution does not re-write already-correct data).
  • FR-009 (Operator Notification on Failure): A failed run MUST notify operator targets with run type + scope + a link to “View run”. v1 delivery:
    • If alert destinations are configured, deliver via existing destinations (Teams webhook / Email).
    • Always notify the initiating platform operator in-app. Success notifications are optional and SHOULD be off by default.

Security & Non-Functional Requirements

  • SR-001 (Control Plane Isolation): /system MUST be isolated to platform identity and MUST deny tenant-plane access as 404 (anti-enumeration).
  • SR-002 (404 vs 403 Semantics):
    • Non-platform users or wrong plane → 404
    • Platform user lacking required capability → 403
  • SR-003 (Login Throttling): The /system/login surface MUST be rate limited at 10/min per IP + username (combined key) and failed login attempts MUST be audited.
  • SR-004 (Session Isolation Strategy): v1 MUST isolate control plane sessions from tenant sessions by using a distinct session cookie name/config for /system (same domain). A dedicated subdomain with separate cookie scope may be introduced later.
  • SR-005 (Break-glass Visibility & Enforcement): Break-glass mode MUST be visually obvious and MUST require a reason; break-glass usage MUST be recorded on the run and in audit.
  • NFR-001 (Performance & Safety):
    • Preflight MUST be read-only and cheap enough for interactive use.
    • Writes MUST be chunked and resilient to partial failures.

UI Action Matrix (mandatory when Filament is changed)

Surface Location Header Actions Inspect Affordance (List/Table) Row Actions (max 2 visible) Bulk Actions (grouped) Empty-State CTA(s) View Header Actions Create/Edit Save+Cancel Audit log? Notes / Exemptions
Runbooks /system/ops/runbooks Preflight (read-only), Run… (write, confirm) N/A View run (after start) None None N/A N/A Yes Run… requires confirmation; typed confirm + reason required for All tenants.
Operation Runs /system/ops/runs N/A List links to run detail (“View run”) View None None N/A N/A Yes Run detail includes scope, actor, counts, outcome/status.

Key Entities (include if feature involves data)

  • Runbook: A predefined operator action with preflight and run behavior.
  • Operation Run: A tracked execution record storing scope, actor, status/outcome, and summary counts.
  • Audit Event: Immutable security/ops log entries for preflight/run lifecycle.
  • Operator Notification: A delivery record/target for failure alerts.
  • Finding: Tenant-owned record whose lifecycle/workflow fields may be backfilled.

User Story 4 - Enterprise-grade UX polish for Ops surfaces (Priority: P2)

As a platform operator, the Ops surfaces should look and feel enterprise-grade with proper visual hierarchy, alert banners, structured card layouts, badge indicators, and metadata so I can quickly assess system state.

Why this priority: Operator trust and efficiency depend on clear, scannable UI. Raw text and flat layouts slow down triage.

Acceptance Scenarios:

  1. Given the operator opens /system/ops/runbooks, Then the operator warning is rendered as a styled alert banner with icon (not plain text).
  2. Given runbooks are listed, Then each runbook is rendered as a structured card with title, description, scope badge, and "Last run" metadata when available.
  3. Given preflight results are displayed, Then stat values use consistent stat-card styling with labels and prominent values.
  4. Given the operator opens /system/ops/runs/{id}, Then status and outcome are rendered as colored badges (consistent with the existing BadgeRenderer), and scope is shown as a badge/tag.
  5. Given the run detail page, Then summary counts are rendered as a labeled grid (not only raw JSON).

Functional Requirements (UX Polish)

  • FR-010 (Operator Warning Banner): The operator warning on /system/ops/runbooks MUST be rendered as a visually distinct alert banner with an exclamation-triangle icon, amber/warning coloring, and clear heading — matching project alert patterns.
  • FR-011 (Runbook Card Layout): Each runbook MUST be rendered as a card with: title (semibold), description, scope badge (e.g., "All tenants"), and optional "Last run" timestamp + status badge when a previous run exists.
  • FR-012 (Preflight Stat Cards): Preflight result values (affected, total scanned, estimated tenants) MUST be rendered in visually prominent stat cards with labeled headers.
  • FR-013 (Run Detail Badges): Status and outcome on run detail pages MUST use the existing BadgeRenderer / BadgeCatalog system for colored badges with icons.
  • FR-014 (Run Detail Summary Grid): Summary counts on run detail MUST be rendered as a labeled key-value grid, not a raw JSON dump (JSON viewer remains available as a disclosure fallback).

Success Criteria (mandatory)

Measurable Outcomes

  • SC-001: In production-like environments, customers have zero UI affordances to trigger backfills/repairs in /admin.
  • SC-002: A platform operator can start a runbook without SSH and reach "View run" in ≤ 3 user interactions from /system/ops/runbooks.
  • SC-003: 100% of run attempts result in an operation run record and start/completion/failure audit events (with failure still recorded even if notifications fail).
  • SC-004: Re-running the same runbook on the same scope after completion results in updated_count = 0 (idempotency).
  • SC-005: Operator warning on runbooks page renders as a styled alert banner (not plain text).