TenantAtlas/specs/113-platform-ops-runbooks/research.md
ahmido 200498fa8e feat(113): Platform Ops Runbooks — UX Polish (Filament-native, system theme, live scope) (#137)
## Summary

Implements and polishes the Platform Ops Runbooks feature (Spec 113) — the operator control plane for safe backfills and data repair from `/system`.

## Changes

### UX Polish (Phase 7 — US4)
- **Filament-native components**: Rewrote `runbooks.blade.php` and `view-run.blade.php` using `<x-filament::section>` instead of raw Tailwind div cards. Cards now render correctly with Filament's built-in borders, shadows and dark mode.
- **System panel theme**: Created `resources/css/filament/system/theme.css` and registered `->viteTheme()` on `SystemPanelProvider`. The system panel previously had no theme CSS registered — Tailwind utility classes weren't compiled for its views, causing the warning icon SVG to expand to full container size.
- **Live scope selector**: Added `->live()` to the scope `Radio` field so "Single tenant" immediately reveals the tenant search dropdown without requiring a Submit first.

### Core Feature (Phases 1–6, previously shipped)
- `/system/ops/runbooks` — runbook catalog, preflight, run with typed confirmation + reason
- `/system/ops/runs` — run history table with status/outcome badges
- `/system/ops/runs/{id}` — run detail view with summary counts, failures, collapsible context
- `FindingsLifecycleBackfillRunbookService` — preflight + execution logic
- AllowedTenantUniverse — scopes tenant picker to non-platform tenants only
- RBAC: `platform.ops.view`, `platform.runbooks.view`, `platform.runbooks.run`, `platform.runbooks.findings.lifecycle_backfill`
- Rate-limited `/system/login` (10/min per IP+username)
- Distinct session cookie for `/system` isolation

## Test Coverage
- 16 tests / 141 assertions — all passing
- Covers: page access, RBAC, preflight, run dispatch, scope selector, run detail, run list

## Checklist
- [x] Filament v5 / Livewire v4 compliant
- [x] Provider registered in `bootstrap/providers.php`
- [x] Destructive actions require confirmation (`->requiresConfirmation()`)
- [x] System panel theme registered (`viteTheme`)
- [x] Pint clean
- [x] Tests pass

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #137
2026-02-27 01:11:25 +00:00

5.6 KiB
Raw Blame History

Research — Spec 113: Platform Ops Runbooks

This file resolves the design unknowns required to produce an implementation plan that fits the existing TenantAtlas codebase.

Decisions

1) Reuse existing backfill pipeline (Command + Job) via a single service

  • Decision: Extract a single “runbook service” that is called from:
    • /system runbook UI (preflight + start)
    • CLI command (tenantpilot:findings:backfill-lifecycle)
    • deploy-time hook
  • Rationale: The repo already contains a correct tenant-scoped implementation:
    • Command: app/Console/Commands/TenantpilotBackfillFindingLifecycle.php
    • Job: app/Jobs/BackfillFindingLifecycleJob.php
    • It uses OperationRunService for lifecycle transitions and idempotency, and a cache lock per tenant.
  • Alternatives considered:
    • Build a new pipeline from scratch → rejected as it duplicates proven behavior and increases drift risk.

2) “All tenants” scope uses a single workspace run updated by many tenant jobs

  • Decision: Implement All-tenants as:
    1. one workspace-scoped OperationRun (tenant_id = null) created with OperationRunService::ensureWorkspaceRunWithIdentity()
    2. fan-out to many queued tenant jobs that all increment the same workspace runs summary_counts and contribute failures
    3. completion via OperationRunService::maybeCompleteBulkRun() when processed >= total (same pattern as workspace backfills)
  • Rationale:
    • This matches an existing proven pattern in the repo (tenantpilot:backfill-workspace-ids + BackfillWorkspaceIdsJob).
    • It yields a single “View run” target with meaningful progress, without needing parent/child run stitching.
    • Tenant isolation remains intact because each job still operates tenant-scoped and holds the existing per-tenant lock.
  • Alternatives considered:
    • Separate per-tenant OperationRun records + an umbrella run → rejected for v1 due to added coordination complexity.

3) Workspace scope for /system runbooks (v1)

  • Decision: v1 targets the default workspace (same workspace that owns the platform Tenant created by PlatformUserSeeder).
  • Rationale:
    • Platform identity currently has no explicit workspace selector in the System panel.
    • Existing seeder creates Workspace(slug=default) and a Tenant(external_id=platform) inside it.
  • Alternatives considered:
    • Multi-workspace operator selection in /system → deferred (not in spec, requires new UX + entitlement model).

4) Remove/disable /admin maintenance action (FR-001)

  • Decision: Remove or feature-flag off the existing /admin header action “Backfill findings lifecycle” currently present in app/Filament/Resources/FindingResource/Pages/ListFindings.php.
  • Rationale: Spec explicitly forbids customer-plane exposure in production-like environments.
  • Alternatives considered:
    • Keep the action but hide visually → rejected; it still exists as an affordance and is easy to re-enable by accident.

5) Session isolation for /system (SR-004)

  • Decision: Add a System-panel-only middleware that sets a dedicated session cookie name for /system/* before StartSession runs.
  • Rationale:
    • SystemPanelProvider defines its own middleware list; we can insert a middleware at the top.
    • Changing config(['session.cookie' => ...]) per request is sufficient for cookie separation without introducing a new domain.
  • Alternatives considered:
    • Separate subdomain → deferred (explicitly “later”).

6) /system/login rate limiting (SR-003)

  • Decision: Implement rate limiting inside app/Filament/System/Pages/Auth/Login.php (override authenticate()) using a combined key: ip + normalized(email) at 10/min.
  • Rationale:
    • The System login already overrides authenticate() to add auditing.
    • Implementing rate limiting here keeps the policy tightly scoped to the System login surface.
  • Alternatives considered:
    • Global route middleware throttle → possible, but harder to scope precisely to this Filament auth page.

7) 404 vs 403 semantics for platform capability checks (SR-002)

  • Decision: Keep cross-plane denial as 404 (existing EnsureCorrectGuard), but missing platform capability should return 403.
  • Rationale:
    • Spec requires: wrong plane → 404; platform lacking capability → 403.
    • Current EnsurePlatformCapability aborts(404), which conflicts with spec.
  • Alternatives considered:
    • Return 404 for missing platform capability → rejected because it contradicts the agreed spec.

8) Failure notifications (FR-009)

  • Decision: On run failure, emit:
    1. the canonical terminal DB notification (OperationRunCompleted) to the initiating platform operator (in-app)
    2. an Alerts event (Teams / Email) if alert routing is configured
  • Rationale:
    • Alerts system already exists (AlertDispatchService + queued deliveries). It can route to Teams webhook / Email.
    • OperationRunCompleted already formats the correct persistent DB notification payload via OperationUxPresenter.
  • Alternatives considered:
    • Send Teams webhook directly from job → rejected; bypasses alert rules/cooldowns/quiet hours.

Notes for implementation

  • Platform capabilities must be defined in the registry (app/Support/Auth/PlatformCapabilities.php) and referenced via constants.
  • The System panel currently does not call ->databaseNotifications(). If we want in-app notifications for platform operators, add it.
  • OperationRun.user_id cannot point to platform_users; use context fields to record platform initiator metadata.