ahmido 200498fa8e feat(113): Platform Ops Runbooks — UX Polish (Filament-native, system theme, live scope) (#137 )

## Summary

Implements and polishes the Platform Ops Runbooks feature (Spec 113) — the operator control plane for safe backfills and data repair from `/system`.

## Changes

### UX Polish (Phase 7 — US4)
- **Filament-native components**: Rewrote `runbooks.blade.php` and `view-run.blade.php` using `<x-filament::section>` instead of raw Tailwind div cards. Cards now render correctly with Filament's built-in borders, shadows and dark mode.
- **System panel theme**: Created `resources/css/filament/system/theme.css` and registered `->viteTheme()` on `SystemPanelProvider`. The system panel previously had no theme CSS registered — Tailwind utility classes weren't compiled for its views, causing the warning icon SVG to expand to full container size.
- **Live scope selector**: Added `->live()` to the scope `Radio` field so "Single tenant" immediately reveals the tenant search dropdown without requiring a Submit first.

### Core Feature (Phases 1–6, previously shipped)
- `/system/ops/runbooks` — runbook catalog, preflight, run with typed confirmation + reason
- `/system/ops/runs` — run history table with status/outcome badges
- `/system/ops/runs/{id}` — run detail view with summary counts, failures, collapsible context
- `FindingsLifecycleBackfillRunbookService` — preflight + execution logic
- AllowedTenantUniverse — scopes tenant picker to non-platform tenants only
- RBAC: `platform.ops.view`, `platform.runbooks.view`, `platform.runbooks.run`, `platform.runbooks.findings.lifecycle_backfill`
- Rate-limited `/system/login` (10/min per IP+username)
- Distinct session cookie for `/system` isolation

## Test Coverage
- 16 tests / 141 assertions — all passing
- Covers: page access, RBAC, preflight, run dispatch, scope selector, run detail, run list

## Checklist
- [x] Filament v5 / Livewire v4 compliant
- [x] Provider registered in `bootstrap/providers.php`
- [x] Destructive actions require confirmation (`->requiresConfirmation()`)
- [x] System panel theme registered (`viteTheme`)
- [x] Pint clean
- [x] Tests pass

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #137

2026-02-27 01:11:25 +00:00

5.6 KiB

Raw Blame History

Research — Spec 113: Platform Ops Runbooks

This file resolves the design unknowns required to produce an implementation plan that fits the existing TenantAtlas codebase.

Decisions

1) Reuse existing backfill pipeline (Command + Job) via a single service

Decision: Extract a single “runbook service” that is called from:
- /system runbook UI (preflight + start)
- CLI command (tenantpilot:findings:backfill-lifecycle)
- deploy-time hook
Rationale: The repo already contains a correct tenant-scoped implementation:
- Command: app/Console/Commands/TenantpilotBackfillFindingLifecycle.php
- Job: app/Jobs/BackfillFindingLifecycleJob.php
- It uses OperationRunService for lifecycle transitions and idempotency, and a cache lock per tenant.
Alternatives considered:
- Build a new pipeline from scratch → rejected as it duplicates proven behavior and increases drift risk.

2) “All tenants” scope uses a single workspace run updated by many tenant jobs

Decision: Implement All-tenants as:
1. one workspace-scoped OperationRun (tenant_id = null) created with OperationRunService::ensureWorkspaceRunWithIdentity()
2. fan-out to many queued tenant jobs that all increment the same workspace run’s summary_counts and contribute failures
3. completion via OperationRunService::maybeCompleteBulkRun() when processed >= total (same pattern as workspace backfills)
Rationale:
- This matches an existing proven pattern in the repo (tenantpilot:backfill-workspace-ids + BackfillWorkspaceIdsJob).
- It yields a single “View run” target with meaningful progress, without needing parent/child run stitching.
- Tenant isolation remains intact because each job still operates tenant-scoped and holds the existing per-tenant lock.
Alternatives considered:
- Separate per-tenant OperationRun records + an umbrella run → rejected for v1 due to added coordination complexity.

3) Workspace scope for /system runbooks (v1)

Decision: v1 targets the default workspace (same workspace that owns the platform Tenant created by PlatformUserSeeder).
Rationale:
- Platform identity currently has no explicit workspace selector in the System panel.
- Existing seeder creates Workspace(slug=default) and a Tenant(external_id=platform) inside it.
Alternatives considered:
- Multi-workspace operator selection in /system → deferred (not in spec, requires new UX + entitlement model).

4) Remove/disable `/admin` maintenance action (FR-001)

Decision: Remove or feature-flag off the existing /admin header action “Backfill findings lifecycle” currently present in app/Filament/Resources/FindingResource/Pages/ListFindings.php.
Rationale: Spec explicitly forbids customer-plane exposure in production-like environments.
Alternatives considered:
- Keep the action but hide visually → rejected; it still exists as an affordance and is easy to re-enable by accident.

5) Session isolation for `/system` (SR-004)

Decision: Add a System-panel-only middleware that sets a dedicated session cookie name for /system/* before StartSession runs.
Rationale:
- SystemPanelProvider defines its own middleware list; we can insert a middleware at the top.
- Changing config(['session.cookie' => ...]) per request is sufficient for cookie separation without introducing a new domain.
Alternatives considered:
- Separate subdomain → deferred (explicitly “later”).

6) `/system/login` rate limiting (SR-003)

Decision: Implement rate limiting inside app/Filament/System/Pages/Auth/Login.php (override authenticate()) using a combined key: ip + normalized(email) at 10/min.
Rationale:
- The System login already overrides authenticate() to add auditing.
- Implementing rate limiting here keeps the policy tightly scoped to the System login surface.
Alternatives considered:
- Global route middleware throttle → possible, but harder to scope precisely to this Filament auth page.

7) 404 vs 403 semantics for platform capability checks (SR-002)

Decision: Keep cross-plane denial as 404 (existing EnsureCorrectGuard), but missing platform capability should return 403.
Rationale:
- Spec requires: wrong plane → 404; platform lacking capability → 403.
- Current EnsurePlatformCapability aborts(404), which conflicts with spec.
Alternatives considered:
- Return 404 for missing platform capability → rejected because it contradicts the agreed spec.

8) Failure notifications (FR-009)

Decision: On run failure, emit:
1. the canonical terminal DB notification (OperationRunCompleted) to the initiating platform operator (in-app)
2. an Alerts event (Teams / Email) if alert routing is configured
Rationale:
- Alerts system already exists (AlertDispatchService + queued deliveries). It can route to Teams webhook / Email.
- OperationRunCompleted already formats the correct persistent DB notification payload via OperationUxPresenter.
Alternatives considered:
- Send Teams webhook directly from job → rejected; bypasses alert rules/cooldowns/quiet hours.

Notes for implementation

Platform capabilities must be defined in the registry (app/Support/Auth/PlatformCapabilities.php) and referenced via constants.
The System panel currently does not call ->databaseNotifications(). If we want in-app notifications for platform operators, add it.
OperationRun.user_id cannot point to platform_users; use context fields to record platform initiator metadata.

5.6 KiB Raw Blame History Unescape Escape

Research — Spec 113: Platform Ops Runbooks

Decisions

1) Reuse existing backfill pipeline (Command + Job) via a single service

2) “All tenants” scope uses a single workspace run updated by many tenant jobs

3) Workspace scope for /system runbooks (v1)

4) Remove/disable /admin maintenance action (FR-001)

5) Session isolation for /system (SR-004)

6) /system/login rate limiting (SR-003)

7) 404 vs 403 semantics for platform capability checks (SR-002)

8) Failure notifications (FR-009)

Notes for implementation

5.6 KiB

Raw Blame History

4) Remove/disable `/admin` maintenance action (FR-001)

5) Session isolation for `/system` (SR-004)

6) `/system/login` rate limiting (SR-003)