Research — Spec 113: Platform Ops Runbooks

This file resolves the design unknowns required to produce an implementation plan that fits the existing TenantAtlas codebase.

Decisions

Decision: Extract a single “runbook service” that is called from:
- /system runbook UI (preflight + start)
- CLI command (tenantpilot:findings:backfill-lifecycle)
- deploy-time hook
Rationale: The repo already contains a correct tenant-scoped implementation:
- Command: app/Console/Commands/TenantpilotBackfillFindingLifecycle.php
- Job: app/Jobs/BackfillFindingLifecycleJob.php
- It uses OperationRunService for lifecycle transitions and idempotency, and a cache lock per tenant.
Alternatives considered:
- Build a new pipeline from scratch → rejected as it duplicates proven behavior and increases drift risk.

Decision: Implement All-tenants as:
1. one workspace-scoped OperationRun (tenant_id = null) created with OperationRunService::ensureWorkspaceRunWithIdentity()
2. fan-out to many queued tenant jobs that all increment the same workspace run’s summary_counts and contribute failures
3. completion via OperationRunService::maybeCompleteBulkRun() when processed >= total (same pattern as workspace backfills)
Rationale:
- This matches an existing proven pattern in the repo (tenantpilot:backfill-workspace-ids + BackfillWorkspaceIdsJob).
- It yields a single “View run” target with meaningful progress, without needing parent/child run stitching.
- Tenant isolation remains intact because each job still operates tenant-scoped and holds the existing per-tenant lock.
Alternatives considered:
- Separate per-tenant OperationRun records + an umbrella run → rejected for v1 due to added coordination complexity.

Decision: v1 targets the default workspace (same workspace that owns the platform Tenant created by PlatformUserSeeder).
Rationale:
- Platform identity currently has no explicit workspace selector in the System panel.
- Existing seeder creates Workspace(slug=default) and a Tenant(external_id=platform) inside it.
Alternatives considered:
- Multi-workspace operator selection in /system → deferred (not in spec, requires new UX + entitlement model).

Decision: Remove or feature-flag off the existing /admin header action “Backfill findings lifecycle” currently present in app/Filament/Resources/FindingResource/Pages/ListFindings.php.
Rationale: Spec explicitly forbids customer-plane exposure in production-like environments.
Alternatives considered:
- Keep the action but hide visually → rejected; it still exists as an affordance and is easy to re-enable by accident.

Decision: Add a System-panel-only middleware that sets a dedicated session cookie name for /system/* before StartSession runs.
Rationale:
- SystemPanelProvider defines its own middleware list; we can insert a middleware at the top.
- Changing config(['session.cookie' => ...]) per request is sufficient for cookie separation without introducing a new domain.
Alternatives considered:
- Separate subdomain → deferred (explicitly “later”).

Decision: Implement rate limiting inside app/Filament/System/Pages/Auth/Login.php (override authenticate()) using a combined key: ip + normalized(email) at 10/min.
Rationale:
- The System login already overrides authenticate() to add auditing.
- Implementing rate limiting here keeps the policy tightly scoped to the System login surface.
Alternatives considered:
- Global route middleware throttle → possible, but harder to scope precisely to this Filament auth page.

Decision: Keep cross-plane denial as 404 (existing EnsureCorrectGuard), but missing platform capability should return 403.
Rationale:
- Spec requires: wrong plane → 404; platform lacking capability → 403.
- Current EnsurePlatformCapability aborts(404), which conflicts with spec.
Alternatives considered:
- Return 404 for missing platform capability → rejected because it contradicts the agreed spec.

Decision: On run failure, emit:
1. the canonical terminal DB notification (OperationRunCompleted) to the initiating platform operator (in-app)
2. an Alerts event (Teams / Email) if alert routing is configured
Rationale:
- Alerts system already exists (AlertDispatchService + queued deliveries). It can route to Teams webhook / Email.
- OperationRunCompleted already formats the correct persistent DB notification payload via OperationUxPresenter.
Alternatives considered:
- Send Teams webhook directly from job → rejected; bypasses alert rules/cooldowns/quiet hours.

Platform capabilities must be defined in the registry (app/Support/Auth/PlatformCapabilities.php) and referenced via constants.
The System panel currently does not call ->databaseNotifications(). If we want in-app notifications for platform operators, add it.
OperationRun.user_id cannot point to platform_users; use context fields to record platform initiator metadata.