TenantAtlas/specs/113-platform-ops-runbooks/research.md
2026-02-26 02:18:19 +01:00

83 lines
5.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Research — Spec 113: Platform Ops Runbooks
This file resolves the design unknowns required to produce an implementation plan that fits the existing TenantAtlas codebase.
## Decisions
### 1) Reuse existing backfill pipeline (Command + Job) via a single service
- **Decision**: Extract a single “runbook service” that is called from:
- `/system` runbook UI (preflight + start)
- CLI command (`tenantpilot:findings:backfill-lifecycle`)
- deploy-time hook
- **Rationale**: The repo already contains a correct tenant-scoped implementation:
- Command: `app/Console/Commands/TenantpilotBackfillFindingLifecycle.php`
- Job: `app/Jobs/BackfillFindingLifecycleJob.php`
- It uses `OperationRunService` for lifecycle transitions and idempotency, and a cache lock per tenant.
- **Alternatives considered**:
- Build a new pipeline from scratch → rejected as it duplicates proven behavior and increases drift risk.
### 2) “All tenants” scope uses a single workspace run updated by many tenant jobs
- **Decision**: Implement All-tenants as:
1) one **workspace-scoped** `OperationRun` (tenant_id = null) created with `OperationRunService::ensureWorkspaceRunWithIdentity()`
2) fan-out to many queued tenant jobs that all **increment the same workspace runs** `summary_counts` and contribute failures
3) completion via `OperationRunService::maybeCompleteBulkRun()` when `processed >= total` (same pattern as workspace backfills)
- **Rationale**:
- This matches an existing proven pattern in the repo (`tenantpilot:backfill-workspace-ids` + `BackfillWorkspaceIdsJob`).
- It yields a single “View run” target with meaningful progress, without needing parent/child run stitching.
- Tenant isolation remains intact because each job still operates tenant-scoped and holds the existing per-tenant lock.
- **Alternatives considered**:
- Separate per-tenant `OperationRun` records + an umbrella run → rejected for v1 due to added coordination complexity.
### 3) Workspace scope for /system runbooks (v1)
- **Decision**: v1 targets the **default workspace** (same workspace that owns the `platform` Tenant created by `PlatformUserSeeder`).
- **Rationale**:
- Platform identity currently has no explicit workspace selector in the System panel.
- Existing seeder creates `Workspace(slug=default)` and a `Tenant(external_id=platform)` inside it.
- **Alternatives considered**:
- Multi-workspace operator selection in `/system` → deferred (not in spec, requires new UX + entitlement model).
### 4) Remove/disable `/admin` maintenance action (FR-001)
- **Decision**: Remove or feature-flag off the existing `/admin` header action “Backfill findings lifecycle” currently present in `app/Filament/Resources/FindingResource/Pages/ListFindings.php`.
- **Rationale**: Spec explicitly forbids customer-plane exposure in production-like environments.
- **Alternatives considered**:
- Keep the action but hide visually → rejected; it still exists as an affordance and is easy to re-enable by accident.
### 5) Session isolation for `/system` (SR-004)
- **Decision**: Add a System-panel-only middleware that sets a dedicated session cookie name for `/system/*` **before** `StartSession` runs.
- **Rationale**:
- SystemPanelProvider defines its own middleware list; we can insert a middleware at the top.
- Changing `config(['session.cookie' => ...])` per request is sufficient for cookie separation without introducing a new domain.
- **Alternatives considered**:
- Separate subdomain → deferred (explicitly “later”).
### 6) `/system/login` rate limiting (SR-003)
- **Decision**: Implement rate limiting inside `app/Filament/System/Pages/Auth/Login.php` (override `authenticate()`) using a combined key: `ip + normalized(email)` at 10/min.
- **Rationale**:
- The System login already overrides `authenticate()` to add auditing.
- Implementing rate limiting here keeps the policy tightly scoped to the System login surface.
- **Alternatives considered**:
- Global route middleware throttle → possible, but harder to scope precisely to this Filament auth page.
### 7) 404 vs 403 semantics for platform capability checks (SR-002)
- **Decision**: Keep cross-plane denial as **404** (existing `EnsureCorrectGuard`), but missing platform capability should return **403**.
- **Rationale**:
- Spec requires: wrong plane → 404; platform lacking capability → 403.
- Current `EnsurePlatformCapability` aborts(404), which conflicts with spec.
- **Alternatives considered**:
- Return 404 for missing platform capability → rejected because it contradicts the agreed spec.
### 8) Failure notifications (FR-009)
- **Decision**: On run failure, emit:
1) the canonical terminal DB notification (`OperationRunCompleted`) to the initiating platform operator (in-app)
2) an Alerts event (Teams / Email) **if alert routing is configured**
- **Rationale**:
- Alerts system already exists (`AlertDispatchService` + queued deliveries). It can route to Teams webhook / Email.
- `OperationRunCompleted` already formats the correct persistent DB notification payload via `OperationUxPresenter`.
- **Alternatives considered**:
- Send Teams webhook directly from job → rejected; bypasses alert rules/cooldowns/quiet hours.
## Notes for implementation
- Platform capabilities must be defined in the registry (`app/Support/Auth/PlatformCapabilities.php`) and referenced via constants.
- The System panel currently does not call `->databaseNotifications()`. If we want in-app notifications for platform operators, add it.
- `OperationRun.user_id` cannot point to `platform_users`; use `context` fields to record platform initiator metadata.