# Research — Spec 113: Platform Ops Runbooks This file resolves the design unknowns required to produce an implementation plan that fits the existing TenantAtlas codebase. ## Decisions ### 1) Reuse existing backfill pipeline (Command + Job) via a single service - **Decision**: Extract a single “runbook service” that is called from: - `/system` runbook UI (preflight + start) - CLI command (`tenantpilot:findings:backfill-lifecycle`) - deploy-time hook - **Rationale**: The repo already contains a correct tenant-scoped implementation: - Command: `app/Console/Commands/TenantpilotBackfillFindingLifecycle.php` - Job: `app/Jobs/BackfillFindingLifecycleJob.php` - It uses `OperationRunService` for lifecycle transitions and idempotency, and a cache lock per tenant. - **Alternatives considered**: - Build a new pipeline from scratch → rejected as it duplicates proven behavior and increases drift risk. ### 2) “All tenants” scope uses a single workspace run updated by many tenant jobs - **Decision**: Implement All-tenants as: 1) one **workspace-scoped** `OperationRun` (tenant_id = null) created with `OperationRunService::ensureWorkspaceRunWithIdentity()` 2) fan-out to many queued tenant jobs that all **increment the same workspace run’s** `summary_counts` and contribute failures 3) completion via `OperationRunService::maybeCompleteBulkRun()` when `processed >= total` (same pattern as workspace backfills) - **Rationale**: - This matches an existing proven pattern in the repo (`tenantpilot:backfill-workspace-ids` + `BackfillWorkspaceIdsJob`). - It yields a single “View run” target with meaningful progress, without needing parent/child run stitching. - Tenant isolation remains intact because each job still operates tenant-scoped and holds the existing per-tenant lock. - **Alternatives considered**: - Separate per-tenant `OperationRun` records + an umbrella run → rejected for v1 due to added coordination complexity. ### 3) Workspace scope for /system runbooks (v1) - **Decision**: v1 targets the **default workspace** (same workspace that owns the `platform` Tenant created by `PlatformUserSeeder`). - **Rationale**: - Platform identity currently has no explicit workspace selector in the System panel. - Existing seeder creates `Workspace(slug=default)` and a `Tenant(external_id=platform)` inside it. - **Alternatives considered**: - Multi-workspace operator selection in `/system` → deferred (not in spec, requires new UX + entitlement model). ### 4) Remove/disable `/admin` maintenance action (FR-001) - **Decision**: Remove or feature-flag off the existing `/admin` header action “Backfill findings lifecycle” currently present in `app/Filament/Resources/FindingResource/Pages/ListFindings.php`. - **Rationale**: Spec explicitly forbids customer-plane exposure in production-like environments. - **Alternatives considered**: - Keep the action but hide visually → rejected; it still exists as an affordance and is easy to re-enable by accident. ### 5) Session isolation for `/system` (SR-004) - **Decision**: Add a System-panel-only middleware that sets a dedicated session cookie name for `/system/*` **before** `StartSession` runs. - **Rationale**: - SystemPanelProvider defines its own middleware list; we can insert a middleware at the top. - Changing `config(['session.cookie' => ...])` per request is sufficient for cookie separation without introducing a new domain. - **Alternatives considered**: - Separate subdomain → deferred (explicitly “later”). ### 6) `/system/login` rate limiting (SR-003) - **Decision**: Implement rate limiting inside `app/Filament/System/Pages/Auth/Login.php` (override `authenticate()`) using a combined key: `ip + normalized(email)` at 10/min. - **Rationale**: - The System login already overrides `authenticate()` to add auditing. - Implementing rate limiting here keeps the policy tightly scoped to the System login surface. - **Alternatives considered**: - Global route middleware throttle → possible, but harder to scope precisely to this Filament auth page. ### 7) 404 vs 403 semantics for platform capability checks (SR-002) - **Decision**: Keep cross-plane denial as **404** (existing `EnsureCorrectGuard`), but missing platform capability should return **403**. - **Rationale**: - Spec requires: wrong plane → 404; platform lacking capability → 403. - Current `EnsurePlatformCapability` aborts(404), which conflicts with spec. - **Alternatives considered**: - Return 404 for missing platform capability → rejected because it contradicts the agreed spec. ### 8) Failure notifications (FR-009) - **Decision**: On run failure, emit: 1) the canonical terminal DB notification (`OperationRunCompleted`) to the initiating platform operator (in-app) 2) an Alerts event (Teams / Email) **if alert routing is configured** - **Rationale**: - Alerts system already exists (`AlertDispatchService` + queued deliveries). It can route to Teams webhook / Email. - `OperationRunCompleted` already formats the correct persistent DB notification payload via `OperationUxPresenter`. - **Alternatives considered**: - Send Teams webhook directly from job → rejected; bypasses alert rules/cooldowns/quiet hours. ## Notes for implementation - Platform capabilities must be defined in the registry (`app/Support/Auth/PlatformCapabilities.php`) and referenced via constants. - The System panel currently does not call `->databaseNotifications()`. If we want in-app notifications for platform operators, add it. - `OperationRun.user_id` cannot point to `platform_users`; use `context` fields to record platform initiator metadata.