TenantAtlas/specs/113-platform-ops-runbooks/spec.md

# Feature Specification: Platform Ops Runbooks (Operator Control Plane) for Backfills & Data Repair

**Feature Branch**: `[113-platform-ops-runbooks]`
**Created**: 2026-02-26
**Status**: Draft
**Input**: Operator control plane runbooks for safe backfills and data repair; deploy-time automatic execution; operator re-run via `/system`; never exposed in customer UI.

## Clarifications

### Session 2026-02-26

- Q: `/system` Session Isolation Strategy (v1) → A: B — Use a distinct session cookie name/config for `/system`.
- Q: `OperationRun.type` for the findings lifecycle backfill runbook → A: Use `findings.lifecycle.backfill` (consistent with the operation catalog). Runbook trigger is exclusive to `/system`; any `/admin` trigger is removed / feature-flagged off.
- Q: v1 scope selector for running the runbook → A: All tenants (default) + Single tenant (picker).
- Q: Failure notification delivery (v1) → A: Deliver via existing alert destinations (Teams webhook / Email) when configured, and always notify the initiating platform operator in-app.
- Q: `/system/login` rate limiting policy (v1) → A: 10/min per IP + username (combined key).
- Q: Platform “allowed tenant universe” (v1) → A: All non-platform tenants present in the database (`tenants.external_id != 'platform'`). The System UI must not allow selecting or targeting the platform tenant.

## Spec Scope Fields *(mandatory)*

- **Scope**: canonical-view (platform control plane)
- **Primary Routes**:
  - `/system/ops/runbooks` (runbook catalog + preflight + run)
  - `/system/ops/runs` (run history + run details)
  - `/admin/*` (explicitly remove any maintenance/backfill affordances)
- **Data Ownership**:
  - Tenant-owned customer data that may be modified by runbooks (e.g., “findings” lifecycle/workflow fields)
  - Platform-owned operational records (operation runs, audit events, operator notifications)
- **RBAC**:
  - Platform identity only (separate from tenant users)
  - Capabilities (v1 minimum): `platform.ops.view`, `platform.runbooks.view`, `platform.runbooks.run`
  - Optional granular capability for this runbook: `platform.runbooks.findings.lifecycle_backfill`

For canonical-view specs, the spec MUST define:

- **Default filter behavior when tenant-context is active**: the runbook defaults to **All tenants** scope; if a tenant is explicitly selected, all counts/changes MUST be limited to that tenant only.
- **Explicit entitlement checks preventing cross-tenant leakage**: a tenant-context user MUST NOT be able to access `/system/*` (deny-as-not-found). Platform operators MUST only be able to target tenants within the platform’s allowed tenant universe.

## User Scenarios & Testing *(mandatory)*

### User Story 1 - Operator runs a runbook safely (Priority: P1)

As a platform operator, I can run a predefined “Rebuild Findings Lifecycle” runbook from `/system` with a clear preflight, explicit confirmation, and an audited, trackable run record.

**Why this priority**: This is the primary operator workflow that eliminates the need for SSH/manual scripts and reduces risk for customer-impacting data changes.

**Independent Test**: Fully testable by visiting `/system/ops/runbooks`, running preflight, starting a run, and verifying the run record + audit events exist.

**Acceptance Scenarios**:

1. **Given** an authorized platform operator, **When** they open `/system/ops/runbooks`, **Then** they see the runbook catalog including “Rebuild Findings Lifecycle” and an operator warning that actions may modify customer data.
2. **Given** preflight reports `affected_count > 0`, **When** the operator confirms the run, **Then** a new operation run is created and the UI links to “View run”.
3. **Given** preflight reports `affected_count = 0`, **When** the operator attempts to run, **Then** the run action is disabled with a clear “Nothing to do” explanation.
4. **Given** the operator chooses “All tenants”, **When** they confirm, **Then** typed confirmation is required (e.g., entering `BACKFILL`) and a reason is required.

---

### User Story 2 - Customers never see maintenance actions (Priority: P1)

As a tenant (customer) user, I never see backfill/repair buttons and cannot access the operator control plane.

**Why this priority**: Exposing maintenance controls in customer UI is an enterprise anti-pattern and undermines product trust.

**Independent Test**: Fully testable by checking `/admin` UI surfaces and attempting direct navigation to `/system/*` as a tenant user.

**Acceptance Scenarios**:

1. **Given** a tenant user session, **When** the user requests `/system/ops/runbooks`, **Then** the response is **404** (deny-as-not-found).
2. **Given** production-like configuration, **When** a tenant user views relevant `/admin` screens, **Then** there is no maintenance/backfill/repair UI.

---

### User Story 3 - Same logic for deploy-time and operator re-run (Priority: P2)

As a platform operator and as a deploy pipeline, the same runbook logic can be executed consistently so that deploy-time backfills are automatic, and manual re-runs remain available and safe.

**Why this priority**: A single execution path reduces drift between “what deploy does” and “what operators can re-run”, and improves reliability.

**Independent Test**: Fully testable by running the operation twice and verifying idempotency and consistent preflight/run results for the same scope.

**Acceptance Scenarios**:

1. **Given** the runbook was executed once successfully, **When** it is executed again with the same scope, **Then** the second run reports `updated_count = 0` (idempotent behavior).

### Edge Cases

- Lock already held: another run is in-progress for the same scope (All tenants or the same tenant).
- Large dataset: preflight must remain fast enough for operator use; writes must be chunked to avoid long locks.
- Partial failure: some tenants/records fail while others succeed; run outcome and audit still record what happened.
- Missing reason: an All-tenants or break-glass run cannot start without a reason.

## Requirements *(mandatory)*

### Constitution alignment notes

- **No customer-plane maintenance**: Any maintenance/backfill/repair affordance in `/admin` is explicitly out of scope for customer UX.
- **Run observability**: Customer-impacting writes MUST be executed as a tracked operation run with clear status/outcome and operator-facing surfaces.
- **Safety gates**: Preflight → explicit confirmation → audited execution is mandatory.

### Functional Requirements

- **FR-001 (Remove Customer Exposure)**: The system MUST not expose any backfill/repair controls in `/admin` in production-like environments. Any legacy `/admin` trigger for the findings lifecycle backfill MUST be removed or disabled (feature-flag off by default).
- **FR-002 (Runbook Catalog)**: The system MUST provide a `/system/ops/runbooks` catalog listing predefined runbooks and their descriptions.
- **FR-003 (Runbook: Rebuild Findings Lifecycle)**: The system MUST provide a runbook that supports:
  - Preflight (read-only) showing at least `affected_count`.
  - Run (write) that starts a tracked operation run and links to “View run”.
  - Scope selection: All tenants (default) and Single tenant (picker).
  - Safe confirmation: includes scope + preflight count + “modifies customer data” warning.
  - Typed confirmation for All-tenants scope (e.g., `BACKFILL`).
  - Run disabled when preflight indicates nothing to do.
- **FR-004 (Single Source of Truth)**: The system MUST implement the runbook logic once and reuse it across:
  - deploy-time execution (automation)
  - operator UI execution in `/system`
  The two paths MUST produce consistent results for the same scope.
- **FR-005 (Operation Run Tracking)**: Each run MUST create a run record including:
  - run type identifier: `findings.lifecycle.backfill`
  - scope (all tenants vs single tenant)
  - actor (platform user, including break-glass marker when applicable)
  - outcome/status transitions owned by the service layer
  - numeric summary counts using a centralized allow-list of keys
  - run context containing: `preflight.affected_count`, `updated_count`, `skipped_count`, `error_count`, and duration
- **FR-006 (Audit Events)**: The system MUST write audit events for start, completion, and failure. Audit writing MUST be fail-safe (audit failures do not crash the operation run).
- **FR-007 (Reasons for Sensitive Runs)**: All-tenants runs and break-glass runs MUST require a reason:
  - `reason_code`: one of `DATA_REPAIR`, `INCIDENT`, `SUPPORT`, `SECURITY`
  - `reason_text`: free text (max 500 characters)
- **FR-008 (Locking & Idempotency)**: The system MUST prevent concurrent runs for the same scope via locking and MUST be idempotent (a second execution does not re-write already-correct data).
- **FR-009 (Operator Notification on Failure)**: A failed run MUST notify operator targets with run type + scope + a link to “View run”. v1 delivery:
  - If alert destinations are configured, deliver via existing destinations (Teams webhook / Email).
  - Always notify the initiating platform operator in-app.
  Success notifications are optional and SHOULD be off by default.

### Security & Non-Functional Requirements

- **SR-001 (Control Plane Isolation)**: `/system` MUST be isolated to platform identity and MUST deny tenant-plane access as **404** (anti-enumeration).
- **SR-002 (404 vs 403 Semantics)**:
  - Non-platform users or wrong plane → **404**
  - Platform user lacking required capability → **403**
- **SR-003 (Login Throttling)**: The `/system/login` surface MUST be rate limited at **10/min per IP + username (combined key)** and failed login attempts MUST be audited.
- **SR-004 (Session Isolation Strategy)**: v1 MUST isolate control plane sessions from tenant sessions by using a distinct session cookie name/config for `/system` (same domain). A dedicated subdomain with separate cookie scope may be introduced later.
- **SR-005 (Break-glass Visibility & Enforcement)**: Break-glass mode MUST be visually obvious and MUST require a reason; break-glass usage MUST be recorded on the run and in audit.
- **NFR-001 (Performance & Safety)**:
  - Preflight MUST be read-only and cheap enough for interactive use.
  - Writes MUST be chunked and resilient to partial failures.

## UI Action Matrix *(mandatory when Filament is changed)*

| Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions |
|---|---|---|---|---|---|---|---|---|---|---|
| Runbooks | `/system/ops/runbooks` | `Preflight` (read-only), `Run…` (write, confirm) | N/A | `View run` (after start) | None | None | N/A | N/A | Yes | `Run…` requires confirmation; typed confirm + reason required for All tenants. |
| Operation Runs | `/system/ops/runs` | N/A | List links to run detail (“View run”) | `View` | None | None | N/A | N/A | Yes | Run detail includes scope, actor, counts, outcome/status. |

### Key Entities *(include if feature involves data)*

- **Runbook**: A predefined operator action with preflight and run behavior.
- **Operation Run**: A tracked execution record storing scope, actor, status/outcome, and summary counts.
- **Audit Event**: Immutable security/ops log entries for preflight/run lifecycle.
- **Operator Notification**: A delivery record/target for failure alerts.
- **Finding**: Tenant-owned record whose lifecycle/workflow fields may be backfilled.

### User Story 4 - Enterprise-grade UX polish for Ops surfaces (Priority: P2)

As a platform operator, the Ops surfaces should look and feel enterprise-grade with proper visual hierarchy, alert banners, structured card layouts, badge indicators, and metadata so I can quickly assess system state.

**Why this priority**: Operator trust and efficiency depend on clear, scannable UI. Raw text and flat layouts slow down triage.

**Acceptance Scenarios**:

1. **Given** the operator opens `/system/ops/runbooks`, **Then** the operator warning is rendered as a styled alert banner with icon (not plain text).
2. **Given** runbooks are listed, **Then** each runbook is rendered as a structured card with title, description, scope badge, and "Last run" metadata when available.
3. **Given** preflight results are displayed, **Then** stat values use consistent stat-card styling with labels and prominent values.
4. **Given** the operator opens `/system/ops/runs/{id}`, **Then** status and outcome are rendered as colored badges (consistent with the existing BadgeRenderer), and scope is shown as a badge/tag.
5. **Given** the run detail page, **Then** summary counts are rendered as a labeled grid (not only raw JSON).

### Functional Requirements (UX Polish)

- **FR-010 (Operator Warning Banner)**: The operator warning on `/system/ops/runbooks` MUST be rendered as a visually distinct alert banner with an `exclamation-triangle` icon, amber/warning coloring, and clear heading — matching project alert patterns.
- **FR-011 (Runbook Card Layout)**: Each runbook MUST be rendered as a card with: title (semibold), description, scope badge (e.g., "All tenants"), and optional "Last run" timestamp + status badge when a previous run exists.
- **FR-012 (Preflight Stat Cards)**: Preflight result values (affected, total scanned, estimated tenants) MUST be rendered in visually prominent stat cards with labeled headers.
- **FR-013 (Run Detail Badges)**: Status and outcome on run detail pages MUST use the existing `BadgeRenderer` / `BadgeCatalog` system for colored badges with icons.
- **FR-014 (Run Detail Summary Grid)**: Summary counts on run detail MUST be rendered as a labeled key-value grid, not a raw JSON dump (JSON viewer remains available as a disclosure fallback).

## Success Criteria *(mandatory)*

### Measurable Outcomes

- **SC-001**: In production-like environments, customers have **zero** UI affordances to trigger backfills/repairs in `/admin`.
- **SC-002**: A platform operator can start a runbook without SSH and reach "View run" in **≤ 3 user interactions** from `/system/ops/runbooks`.
- **SC-003**: 100% of run attempts result in an operation run record and start/completion/failure audit events (with failure still recorded even if notifications fail).
- **SC-004**: Re-running the same runbook on the same scope after completion results in `updated_count = 0` (idempotency).
- **SC-005**: Operator warning on runbooks page renders as a styled alert banner (not plain text).