TenantAtlas/specs/113-platform-ops-runbooks/spec.md
2026-02-26 02:18:19 +01:00

168 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Feature Specification: Platform Ops Runbooks (Operator Control Plane) for Backfills & Data Repair
**Feature Branch**: `[113-platform-ops-runbooks]`
**Created**: 2026-02-26
**Status**: Draft
**Input**: Operator control plane runbooks for safe backfills and data repair; deploy-time automatic execution; operator re-run via `/system`; never exposed in customer UI.
## Clarifications
### Session 2026-02-26
- Q: `/system` Session Isolation Strategy (v1) → A: B — Use a distinct session cookie name/config for `/system`.
- Q: `OperationRun.type` for the findings lifecycle backfill runbook → A: Use `findings.lifecycle.backfill` (consistent with the operation catalog). Runbook trigger is exclusive to `/system`; any `/admin` trigger is removed / feature-flagged off.
- Q: v1 scope selector for running the runbook → A: All tenants (default) + Single tenant (picker).
- Q: Failure notification delivery (v1) → A: Deliver via existing alert destinations (Teams webhook / Email) when configured, and always notify the initiating platform operator in-app.
- Q: `/system/login` rate limiting policy (v1) → A: 10/min per IP + username (combined key).
- Q: Platform “allowed tenant universe” (v1) → A: All non-platform tenants present in the database (`tenants.external_id != 'platform'`). The System UI must not allow selecting or targeting the platform tenant.
## Spec Scope Fields *(mandatory)*
- **Scope**: canonical-view (platform control plane)
- **Primary Routes**:
- `/system/ops/runbooks` (runbook catalog + preflight + run)
- `/system/ops/runs` (run history + run details)
- `/admin/*` (explicitly remove any maintenance/backfill affordances)
- **Data Ownership**:
- Tenant-owned customer data that may be modified by runbooks (e.g., “findings” lifecycle/workflow fields)
- Platform-owned operational records (operation runs, audit events, operator notifications)
- **RBAC**:
- Platform identity only (separate from tenant users)
- Capabilities (v1 minimum): `platform.ops.view`, `platform.runbooks.view`, `platform.runbooks.run`
- Optional granular capability for this runbook: `platform.runbooks.findings.lifecycle_backfill`
For canonical-view specs, the spec MUST define:
- **Default filter behavior when tenant-context is active**: the runbook defaults to **All tenants** scope; if a tenant is explicitly selected, all counts/changes MUST be limited to that tenant only.
- **Explicit entitlement checks preventing cross-tenant leakage**: a tenant-context user MUST NOT be able to access `/system/*` (deny-as-not-found). Platform operators MUST only be able to target tenants within the platforms allowed tenant universe.
## User Scenarios & Testing *(mandatory)*
### User Story 1 - Operator runs a runbook safely (Priority: P1)
As a platform operator, I can run a predefined “Rebuild Findings Lifecycle” runbook from `/system` with a clear preflight, explicit confirmation, and an audited, trackable run record.
**Why this priority**: This is the primary operator workflow that eliminates the need for SSH/manual scripts and reduces risk for customer-impacting data changes.
**Independent Test**: Fully testable by visiting `/system/ops/runbooks`, running preflight, starting a run, and verifying the run record + audit events exist.
**Acceptance Scenarios**:
1. **Given** an authorized platform operator, **When** they open `/system/ops/runbooks`, **Then** they see the runbook catalog including “Rebuild Findings Lifecycle” and an operator warning that actions may modify customer data.
2. **Given** preflight reports `affected_count > 0`, **When** the operator confirms the run, **Then** a new operation run is created and the UI links to “View run”.
3. **Given** preflight reports `affected_count = 0`, **When** the operator attempts to run, **Then** the run action is disabled with a clear “Nothing to do” explanation.
4. **Given** the operator chooses “All tenants”, **When** they confirm, **Then** typed confirmation is required (e.g., entering `BACKFILL`) and a reason is required.
---
### User Story 2 - Customers never see maintenance actions (Priority: P1)
As a tenant (customer) user, I never see backfill/repair buttons and cannot access the operator control plane.
**Why this priority**: Exposing maintenance controls in customer UI is an enterprise anti-pattern and undermines product trust.
**Independent Test**: Fully testable by checking `/admin` UI surfaces and attempting direct navigation to `/system/*` as a tenant user.
**Acceptance Scenarios**:
1. **Given** a tenant user session, **When** the user requests `/system/ops/runbooks`, **Then** the response is **404** (deny-as-not-found).
2. **Given** production-like configuration, **When** a tenant user views relevant `/admin` screens, **Then** there is no maintenance/backfill/repair UI.
---
### User Story 3 - Same logic for deploy-time and operator re-run (Priority: P2)
As a platform operator and as a deploy pipeline, the same runbook logic can be executed consistently so that deploy-time backfills are automatic, and manual re-runs remain available and safe.
**Why this priority**: A single execution path reduces drift between “what deploy does” and “what operators can re-run”, and improves reliability.
**Independent Test**: Fully testable by running the operation twice and verifying idempotency and consistent preflight/run results for the same scope.
**Acceptance Scenarios**:
1. **Given** the runbook was executed once successfully, **When** it is executed again with the same scope, **Then** the second run reports `updated_count = 0` (idempotent behavior).
### Edge Cases
- Lock already held: another run is in-progress for the same scope (All tenants or the same tenant).
- Large dataset: preflight must remain fast enough for operator use; writes must be chunked to avoid long locks.
- Partial failure: some tenants/records fail while others succeed; run outcome and audit still record what happened.
- Missing reason: an All-tenants or break-glass run cannot start without a reason.
## Requirements *(mandatory)*
### Constitution alignment notes
- **No customer-plane maintenance**: Any maintenance/backfill/repair affordance in `/admin` is explicitly out of scope for customer UX.
- **Run observability**: Customer-impacting writes MUST be executed as a tracked operation run with clear status/outcome and operator-facing surfaces.
- **Safety gates**: Preflight → explicit confirmation → audited execution is mandatory.
### Functional Requirements
- **FR-001 (Remove Customer Exposure)**: The system MUST not expose any backfill/repair controls in `/admin` in production-like environments. Any legacy `/admin` trigger for the findings lifecycle backfill MUST be removed or disabled (feature-flag off by default).
- **FR-002 (Runbook Catalog)**: The system MUST provide a `/system/ops/runbooks` catalog listing predefined runbooks and their descriptions.
- **FR-003 (Runbook: Rebuild Findings Lifecycle)**: The system MUST provide a runbook that supports:
- Preflight (read-only) showing at least `affected_count`.
- Run (write) that starts a tracked operation run and links to “View run”.
- Scope selection: All tenants (default) and Single tenant (picker).
- Safe confirmation: includes scope + preflight count + “modifies customer data” warning.
- Typed confirmation for All-tenants scope (e.g., `BACKFILL`).
- Run disabled when preflight indicates nothing to do.
- **FR-004 (Single Source of Truth)**: The system MUST implement the runbook logic once and reuse it across:
- deploy-time execution (automation)
- operator UI execution in `/system`
The two paths MUST produce consistent results for the same scope.
- **FR-005 (Operation Run Tracking)**: Each run MUST create a run record including:
- run type identifier: `findings.lifecycle.backfill`
- scope (all tenants vs single tenant)
- actor (platform user, including break-glass marker when applicable)
- outcome/status transitions owned by the service layer
- numeric summary counts using a centralized allow-list of keys
- run context containing: `preflight.affected_count`, `updated_count`, `skipped_count`, `error_count`, and duration
- **FR-006 (Audit Events)**: The system MUST write audit events for start, completion, and failure. Audit writing MUST be fail-safe (audit failures do not crash the operation run).
- **FR-007 (Reasons for Sensitive Runs)**: All-tenants runs and break-glass runs MUST require a reason:
- `reason_code`: one of `DATA_REPAIR`, `INCIDENT`, `SUPPORT`, `SECURITY`
- `reason_text`: free text (max 500 characters)
- **FR-008 (Locking & Idempotency)**: The system MUST prevent concurrent runs for the same scope via locking and MUST be idempotent (a second execution does not re-write already-correct data).
- **FR-009 (Operator Notification on Failure)**: A failed run MUST notify operator targets with run type + scope + a link to “View run”. v1 delivery:
- If alert destinations are configured, deliver via existing destinations (Teams webhook / Email).
- Always notify the initiating platform operator in-app.
Success notifications are optional and SHOULD be off by default.
### Security & Non-Functional Requirements
- **SR-001 (Control Plane Isolation)**: `/system` MUST be isolated to platform identity and MUST deny tenant-plane access as **404** (anti-enumeration).
- **SR-002 (404 vs 403 Semantics)**:
- Non-platform users or wrong plane → **404**
- Platform user lacking required capability → **403**
- **SR-003 (Login Throttling)**: The `/system/login` surface MUST be rate limited at **10/min per IP + username (combined key)** and failed login attempts MUST be audited.
- **SR-004 (Session Isolation Strategy)**: v1 MUST isolate control plane sessions from tenant sessions by using a distinct session cookie name/config for `/system` (same domain). A dedicated subdomain with separate cookie scope may be introduced later.
- **SR-005 (Break-glass Visibility & Enforcement)**: Break-glass mode MUST be visually obvious and MUST require a reason; break-glass usage MUST be recorded on the run and in audit.
- **NFR-001 (Performance & Safety)**:
- Preflight MUST be read-only and cheap enough for interactive use.
- Writes MUST be chunked and resilient to partial failures.
## UI Action Matrix *(mandatory when Filament is changed)*
| Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions |
|---|---|---|---|---|---|---|---|---|---|---|
| Runbooks | `/system/ops/runbooks` | `Preflight` (read-only), `Run…` (write, confirm) | N/A | `View run` (after start) | None | None | N/A | N/A | Yes | `Run…` requires confirmation; typed confirm + reason required for All tenants. |
| Operation Runs | `/system/ops/runs` | N/A | List links to run detail (“View run”) | `View` | None | None | N/A | N/A | Yes | Run detail includes scope, actor, counts, outcome/status. |
### Key Entities *(include if feature involves data)*
- **Runbook**: A predefined operator action with preflight and run behavior.
- **Operation Run**: A tracked execution record storing scope, actor, status/outcome, and summary counts.
- **Audit Event**: Immutable security/ops log entries for preflight/run lifecycle.
- **Operator Notification**: A delivery record/target for failure alerts.
- **Finding**: Tenant-owned record whose lifecycle/workflow fields may be backfilled.
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-001**: In production-like environments, customers have **zero** UI affordances to trigger backfills/repairs in `/admin`.
- **SC-002**: A platform operator can start a runbook without SSH and reach “View run” in **≤ 3 user interactions** from `/system/ops/runbooks`.
- **SC-003**: 100% of run attempts result in an operation run record and start/completion/failure audit events (with failure still recorded even if notifications fail).
- **SC-004**: Re-running the same runbook on the same scope after completion results in `updated_count = 0` (idempotency).