TenantAtlas/specs/113-platform-ops-runbooks/plan.md
2026-02-26 02:18:19 +01:00

4.7 KiB

Implementation Plan: Platform Ops Runbooks (Spec 113)

Branch: [113-platform-ops-runbooks] | Date: 2026-02-26
Spec: specs/113-platform-ops-runbooks/spec.md
Input: Feature specification + design artifacts in specs/113-platform-ops-runbooks/

Note: This file is generated/maintained via Spec Kit (/speckit.plan). Keep it concise and free of placeholders/duplicates.

Summary

Introduce a /system operator control plane for safe backfills/data repair.

v1 delivers one runbook: Rebuild Findings Lifecycle. It must:

  • preflight (read-only)
  • require explicit confirmation (typed confirmation for all-tenants) + reason capture
  • execute as a tracked OperationRun with audit events + locking + idempotency
  • be never exposed in the customer /admin plane
  • reuse one shared code path across System UI + CLI + deploy hook

Technical Context

  • Language/Runtime: PHP 8.4, Laravel 12
  • Admin UI: Filament v5 (Livewire v4)
  • Storage: PostgreSQL
  • Testing: Pest v4 (required for runtime behavior changes)
  • Ops primitives: OperationRun + OperationRunService (service owns status/outcome transitions)

Non-negotiables (Constitution / Spec constraints)

  • Cross-plane access (/admin/system) must be deny-as-not-found (404).
  • Platform user missing a required capability must be 403.
  • /system session cookie must be isolated (distinct cookie name) and applied before StartSession.
  • /system/login throttling: 10/min per IP + username key; failed login attempts are audited.
  • Any destructive-like action uses Filament ->action(...) and ->requiresConfirmation().
  • Ops-UX contract: toast intent-only; progress in run detail; terminal DB notification is OperationRunCompleted (initiator-only); no queued/running DB notifications.
  • Audit writes are fail-safe (audit failure must not crash the runbook).

Scope decisions (v1)

  • Canonical run viewing for this spec is the System panel:
    • Runbooks: /system/ops/runbooks
    • Runs: /system/ops/runs
  • Allowed tenant universe (v1): all non-platform tenants present in the database (tenants.external_id != 'platform'). The System UI must not allow selecting or targeting the platform tenant.

Project Structure

Documentation

specs/113-platform-ops-runbooks/
├── spec.md
├── plan.md
├── research.md
├── data-model.md
├── quickstart.md
├── tasks.md
└── contracts/
    └── system-ops-runbooks.openapi.yaml

Source code (planned touch points)

app/
├── Console/Commands/
│   ├── TenantpilotBackfillFindingLifecycle.php
│   └── TenantpilotRunDeployRunbooks.php
├── Filament/System/Pages/
│   └── Ops/
│       ├── Runbooks.php
│       ├── Runs.php
│       └── ViewRun.php
├── Http/Middleware/
│   ├── EnsureCorrectGuard.php
│   ├── EnsurePlatformCapability.php
│   └── UseSystemSessionCookie.php
├── Jobs/
│   ├── BackfillFindingLifecycleJob.php
│   ├── BackfillFindingLifecycleWorkspaceJob.php
│   └── BackfillFindingLifecycleTenantIntoWorkspaceRunJob.php
├── Providers/Filament/
│   └── SystemPanelProvider.php
├── Services/
│   ├── Alerts/AlertDispatchService.php
│   ├── OperationRunService.php
│   └── Runbooks/FindingsLifecycleBackfillRunbookService.php
└── Support/Auth/
    └── PlatformCapabilities.php

resources/views/filament/system/pages/ops/
├── runbooks.blade.php
├── runs.blade.php
└── view-run.blade.php

tests/Feature/System/
├── Spec113/
└── OpsRunbooks/

Implementation Phases

  1. Foundational security hardening

    • Capability registry additions.
    • 404 vs 403 semantics correctness.
    • System session cookie isolation.
    • System login throttling.
  2. Runbook core service (single source of truth)

    • preflight(scope) + start(scope, initiator, reason, source).
    • Audit events (fail-safe).
    • Locking + idempotency.
  3. Execution pipeline

    • All-tenants orchestration as a workspace-scoped bulk run.
    • Fan-out tenant jobs update shared run counts and completion.
  4. System UI surfaces

    • /system/ops/runbooks (preflight + confirm + start).
    • /system/ops/runs list + /system/ops/runs/{run} detail.
  5. Remove customer-plane exposure

    • Remove/disable /admin maintenance trigger (feature flag default-off) + regression test.
  6. Shared entry points

    • Refactor existing CLI command to call the shared service.

    • Add deploy hook command that calls the same service.

    • Run focused tests + formatting (vendor/bin/sail artisan test --compact + vendor/bin/sail bin pint --dirty).