TenantAtlas/specs/113-platform-ops-runbooks/plan.md
2026-02-26 02:18:19 +01:00

129 lines
4.7 KiB
Markdown

# Implementation Plan: Platform Ops Runbooks (Spec 113)
**Branch**: `[113-platform-ops-runbooks]` | **Date**: 2026-02-26
**Spec**: `specs/113-platform-ops-runbooks/spec.md`
**Input**: Feature specification + design artifacts in `specs/113-platform-ops-runbooks/`
**Note**: This file is generated/maintained via Spec Kit (`/speckit.plan`). Keep it concise and free of placeholders/duplicates.
## Summary
Introduce a `/system` operator control plane for safe backfills/data repair.
v1 delivers one runbook: **Rebuild Findings Lifecycle**. It must:
- preflight (read-only)
- require explicit confirmation (typed confirmation for all-tenants) + reason capture
- execute as a tracked `OperationRun` with audit events + locking + idempotency
- be **never exposed** in the customer `/admin` plane
- reuse one shared code path across System UI + CLI + deploy hook
## Technical Context
- **Language/Runtime**: PHP 8.4, Laravel 12
- **Admin UI**: Filament v5 (Livewire v4)
- **Storage**: PostgreSQL
- **Testing**: Pest v4 (required for runtime behavior changes)
- **Ops primitives**: `OperationRun` + `OperationRunService` (service owns status/outcome transitions)
## Non-negotiables (Constitution / Spec constraints)
- Cross-plane access (`/admin` → `/system`) must be deny-as-not-found (**404**).
- Platform user missing a required capability must be **403**.
- `/system` session cookie must be isolated (distinct cookie name) and applied **before** `StartSession`.
- `/system/login` throttling: **10/min** per **IP + username** key; failed login attempts are audited.
- Any destructive-like action uses Filament `->action(...)` and `->requiresConfirmation()`.
- Ops-UX contract: toast intent-only; progress in run detail; terminal DB notification is `OperationRunCompleted` (initiator-only); no queued/running DB notifications.
- Audit writes are fail-safe (audit failure must not crash the runbook).
## Scope decisions (v1)
- **Canonical run viewing** for this spec is the **System panel**:
- Runbooks: `/system/ops/runbooks`
- Runs: `/system/ops/runs`
- **Allowed tenant universe (v1)**: all non-platform tenants present in the database (`tenants.external_id != 'platform'`). The System UI must not allow selecting or targeting the platform tenant.
## Project Structure
### Documentation
```text
specs/113-platform-ops-runbooks/
├── spec.md
├── plan.md
├── research.md
├── data-model.md
├── quickstart.md
├── tasks.md
└── contracts/
└── system-ops-runbooks.openapi.yaml
```
### Source code (planned touch points)
```text
app/
├── Console/Commands/
│ ├── TenantpilotBackfillFindingLifecycle.php
│ └── TenantpilotRunDeployRunbooks.php
├── Filament/System/Pages/
│ └── Ops/
│ ├── Runbooks.php
│ ├── Runs.php
│ └── ViewRun.php
├── Http/Middleware/
│ ├── EnsureCorrectGuard.php
│ ├── EnsurePlatformCapability.php
│ └── UseSystemSessionCookie.php
├── Jobs/
│ ├── BackfillFindingLifecycleJob.php
│ ├── BackfillFindingLifecycleWorkspaceJob.php
│ └── BackfillFindingLifecycleTenantIntoWorkspaceRunJob.php
├── Providers/Filament/
│ └── SystemPanelProvider.php
├── Services/
│ ├── Alerts/AlertDispatchService.php
│ ├── OperationRunService.php
│ └── Runbooks/FindingsLifecycleBackfillRunbookService.php
└── Support/Auth/
└── PlatformCapabilities.php
resources/views/filament/system/pages/ops/
├── runbooks.blade.php
├── runs.blade.php
└── view-run.blade.php
tests/Feature/System/
├── Spec113/
└── OpsRunbooks/
```
## Implementation Phases
1) **Foundational security hardening**
- Capability registry additions.
- 404 vs 403 semantics correctness.
- System session cookie isolation.
- System login throttling.
2) **Runbook core service (single source of truth)**
- `preflight(scope)` + `start(scope, initiator, reason, source)`.
- Audit events (fail-safe).
- Locking + idempotency.
3) **Execution pipeline**
- All-tenants orchestration as a workspace-scoped bulk run.
- Fan-out tenant jobs update shared run counts and completion.
4) **System UI surfaces**
- `/system/ops/runbooks` (preflight + confirm + start).
- `/system/ops/runs` list + `/system/ops/runs/{run}` detail.
5) **Remove customer-plane exposure**
- Remove/disable `/admin` maintenance trigger (feature flag default-off) + regression test.
6) **Shared entry points**
- Refactor existing CLI command to call the shared service.
- Add deploy hook command that calls the same service.
- Run focused tests + formatting (`vendor/bin/sail artisan test --compact` + `vendor/bin/sail bin pint --dirty`).