83 lines
5.6 KiB
Markdown
83 lines
5.6 KiB
Markdown
# Research — Spec 113: Platform Ops Runbooks
|
||
|
||
This file resolves the design unknowns required to produce an implementation plan that fits the existing TenantAtlas codebase.
|
||
|
||
## Decisions
|
||
|
||
### 1) Reuse existing backfill pipeline (Command + Job) via a single service
|
||
- **Decision**: Extract a single “runbook service” that is called from:
|
||
- `/system` runbook UI (preflight + start)
|
||
- CLI command (`tenantpilot:findings:backfill-lifecycle`)
|
||
- deploy-time hook
|
||
- **Rationale**: The repo already contains a correct tenant-scoped implementation:
|
||
- Command: `app/Console/Commands/TenantpilotBackfillFindingLifecycle.php`
|
||
- Job: `app/Jobs/BackfillFindingLifecycleJob.php`
|
||
- It uses `OperationRunService` for lifecycle transitions and idempotency, and a cache lock per tenant.
|
||
- **Alternatives considered**:
|
||
- Build a new pipeline from scratch → rejected as it duplicates proven behavior and increases drift risk.
|
||
|
||
### 2) “All tenants” scope uses a single workspace run updated by many tenant jobs
|
||
- **Decision**: Implement All-tenants as:
|
||
1) one **workspace-scoped** `OperationRun` (tenant_id = null) created with `OperationRunService::ensureWorkspaceRunWithIdentity()`
|
||
2) fan-out to many queued tenant jobs that all **increment the same workspace run’s** `summary_counts` and contribute failures
|
||
3) completion via `OperationRunService::maybeCompleteBulkRun()` when `processed >= total` (same pattern as workspace backfills)
|
||
- **Rationale**:
|
||
- This matches an existing proven pattern in the repo (`tenantpilot:backfill-workspace-ids` + `BackfillWorkspaceIdsJob`).
|
||
- It yields a single “View run” target with meaningful progress, without needing parent/child run stitching.
|
||
- Tenant isolation remains intact because each job still operates tenant-scoped and holds the existing per-tenant lock.
|
||
- **Alternatives considered**:
|
||
- Separate per-tenant `OperationRun` records + an umbrella run → rejected for v1 due to added coordination complexity.
|
||
|
||
### 3) Workspace scope for /system runbooks (v1)
|
||
- **Decision**: v1 targets the **default workspace** (same workspace that owns the `platform` Tenant created by `PlatformUserSeeder`).
|
||
- **Rationale**:
|
||
- Platform identity currently has no explicit workspace selector in the System panel.
|
||
- Existing seeder creates `Workspace(slug=default)` and a `Tenant(external_id=platform)` inside it.
|
||
- **Alternatives considered**:
|
||
- Multi-workspace operator selection in `/system` → deferred (not in spec, requires new UX + entitlement model).
|
||
|
||
### 4) Remove/disable `/admin` maintenance action (FR-001)
|
||
- **Decision**: Remove or feature-flag off the existing `/admin` header action “Backfill findings lifecycle” currently present in `app/Filament/Resources/FindingResource/Pages/ListFindings.php`.
|
||
- **Rationale**: Spec explicitly forbids customer-plane exposure in production-like environments.
|
||
- **Alternatives considered**:
|
||
- Keep the action but hide visually → rejected; it still exists as an affordance and is easy to re-enable by accident.
|
||
|
||
### 5) Session isolation for `/system` (SR-004)
|
||
- **Decision**: Add a System-panel-only middleware that sets a dedicated session cookie name for `/system/*` **before** `StartSession` runs.
|
||
- **Rationale**:
|
||
- SystemPanelProvider defines its own middleware list; we can insert a middleware at the top.
|
||
- Changing `config(['session.cookie' => ...])` per request is sufficient for cookie separation without introducing a new domain.
|
||
- **Alternatives considered**:
|
||
- Separate subdomain → deferred (explicitly “later”).
|
||
|
||
### 6) `/system/login` rate limiting (SR-003)
|
||
- **Decision**: Implement rate limiting inside `app/Filament/System/Pages/Auth/Login.php` (override `authenticate()`) using a combined key: `ip + normalized(email)` at 10/min.
|
||
- **Rationale**:
|
||
- The System login already overrides `authenticate()` to add auditing.
|
||
- Implementing rate limiting here keeps the policy tightly scoped to the System login surface.
|
||
- **Alternatives considered**:
|
||
- Global route middleware throttle → possible, but harder to scope precisely to this Filament auth page.
|
||
|
||
### 7) 404 vs 403 semantics for platform capability checks (SR-002)
|
||
- **Decision**: Keep cross-plane denial as **404** (existing `EnsureCorrectGuard`), but missing platform capability should return **403**.
|
||
- **Rationale**:
|
||
- Spec requires: wrong plane → 404; platform lacking capability → 403.
|
||
- Current `EnsurePlatformCapability` aborts(404), which conflicts with spec.
|
||
- **Alternatives considered**:
|
||
- Return 404 for missing platform capability → rejected because it contradicts the agreed spec.
|
||
|
||
### 8) Failure notifications (FR-009)
|
||
- **Decision**: On run failure, emit:
|
||
1) the canonical terminal DB notification (`OperationRunCompleted`) to the initiating platform operator (in-app)
|
||
2) an Alerts event (Teams / Email) **if alert routing is configured**
|
||
- **Rationale**:
|
||
- Alerts system already exists (`AlertDispatchService` + queued deliveries). It can route to Teams webhook / Email.
|
||
- `OperationRunCompleted` already formats the correct persistent DB notification payload via `OperationUxPresenter`.
|
||
- **Alternatives considered**:
|
||
- Send Teams webhook directly from job → rejected; bypasses alert rules/cooldowns/quiet hours.
|
||
|
||
## Notes for implementation
|
||
- Platform capabilities must be defined in the registry (`app/Support/Auth/PlatformCapabilities.php`) and referenced via constants.
|
||
- The System panel currently does not call `->databaseNotifications()`. If we want in-app notifications for platform operators, add it.
|
||
- `OperationRun.user_id` cannot point to `platform_users`; use `context` fields to record platform initiator metadata.
|