Implements Spec 114 System Console Control Tower pages, widgets, triage actions, directory views, and enterprise polish (badges, repair workspace owners table, health indicator).
98 lines
4.8 KiB
Markdown
98 lines
4.8 KiB
Markdown
# Phase 0 — Research (Spec 114: System Console Control Tower)
|
|
|
|
## Goal
|
|
Deliver a platform-operator “/system” control plane that is **strictly separated** from “/admin”, is **metadata-only by default**, and provides fast routing into canonical `OperationRun` detail.
|
|
|
|
## Existing primitives (reuse)
|
|
|
|
### System panel + plane separation
|
|
- `app/Providers/Filament/SystemPanelProvider.php`
|
|
- Panel: `id=system`, `path=system`, `authGuard('platform')`
|
|
- Uses `UseSystemSessionCookie` to isolate sessions from `/admin`
|
|
- Uses middleware `ensure-correct-guard:platform` and capability gate `ensure-platform-capability:<ACCESS_SYSTEM_PANEL>`
|
|
- `app/Http/Middleware/UseSystemSessionCookie.php`
|
|
- Implements Spec 114 clarification: separate session cookie name for `/system`
|
|
|
|
### Authorization semantics (404 vs 403)
|
|
- Existing tests already enforce the clarified behavior:
|
|
- Non-platform (wrong guard) → 404 (deny-as-not-found)
|
|
- Platform user missing capability → 403
|
|
|
|
### Operation runs (Monitoring source of truth)
|
|
- `app/Models/OperationRun.php` + migrations under `database/migrations/*operation_runs*`
|
|
- `workspace_id` is required; `tenant_id` is nullable (supports tenantless runs)
|
|
- `failure_summary`, `summary_counts`, `context` are JSON arrays and already used in UI
|
|
- `app/Services/OperationRunService.php`
|
|
- Canonical lifecycle transitions, summary-count normalization, failure sanitization
|
|
- Has stale queued run helper (`isStaleQueuedRun()` + `failStaleQueuedRun()`)
|
|
- Canonical System run links:
|
|
- `app/Support/System/SystemOperationRunLinks.php` (index + view)
|
|
|
|
### Sanitization / data minimization
|
|
- Failures: `app/Support/OpsUx/RunFailureSanitizer.php` (reason normalization + message redaction)
|
|
- Audit metadata: `app/Support/Audit/AuditContextSanitizer.php` (redacts token/secret/password-like keys + bearer/JWT strings)
|
|
|
|
### Access logs signal source
|
|
- `app/Models/AuditLog.php`
|
|
- System login auditing:
|
|
- `app/Filament/System/Pages/Auth/Login.php` writes `AuditLog` events with action `platform.auth.login`
|
|
- Break-glass auditing:
|
|
- `app/Services/Auth/BreakGlassSession.php` writes `platform.break_glass.enter|exit|expired`
|
|
|
|
## Key gaps to implement (Spec 114)
|
|
|
|
### Navigation/IA
|
|
- Add System pages:
|
|
- `/system/directory/workspaces` (+ detail)
|
|
- `/system/directory/tenants` (+ detail)
|
|
- `/system/ops/runs` (global) + canonical detail already exists but is currently *runbook-type scoped*
|
|
- `/system/ops/failures` (prefilter)
|
|
- `/system/ops/stuck` (prefilter)
|
|
- `/system/security/access-logs`
|
|
|
|
### RBAC (platform capabilities)
|
|
- `app/Support/Auth/PlatformCapabilities.php` currently contains only Ops/runbooks/break-glass/core panel access.
|
|
- Spec 114 introduces additional capabilities (e.g. `platform.console.view`, `platform.directory.view`, `platform.operations.manage`).
|
|
|
|
Decision:
|
|
- Extend `PlatformCapabilities` registry with Spec 114 capabilities and update system pages to gate via the registry constants (no raw strings).
|
|
|
|
### Stuck definition
|
|
- There is a helper for “stale queued” in `OperationRunService`, but no “running too long” classification.
|
|
|
|
Decision:
|
|
- Introduce configurable stuck thresholds for `queued` and `running` (minutes) under a single config namespace (e.g. `config/tenantpilot.php`), and implement stuck classification in a dedicated helper/service used by the System pages.
|
|
|
|
### Control Tower aggregation
|
|
- Spec 114 requires KPIs + top offenders in a selectable time window.
|
|
|
|
Decision:
|
|
- Use DB-only aggregation on `operation_runs` for the selected time window:
|
|
- KPIs: counts by outcome/status, and “failed/stuck” counts
|
|
- Top offenders: group by tenant/workspace for failed runs
|
|
- Default time window: 24h; supported: 1h/24h/7d
|
|
|
|
## Non-functional decisions (resolving “NEEDS CLARIFICATION”)
|
|
|
|
### Technical context (resolved)
|
|
- Language/runtime: PHP 8.4 (Laravel 12)
|
|
- Admin framework: Filament v5 + Livewire v4
|
|
- Storage: PostgreSQL (Sail locally)
|
|
- Testing: Pest v4
|
|
- Target: web app (server-rendered Livewire/Filament)
|
|
|
|
### Performance goals (assumptions, but explicit)
|
|
- System list pages are DB-only at render time; no external calls.
|
|
- Target: p95 < 1.0s for index pages at typical production volumes, using:
|
|
- time-window defaults (24h)
|
|
- pagination
|
|
- indexes for `operation_runs(status,outcome,created_at,type,workspace_id,tenant_id)` and `audit_logs(action,recorded_at,actor_id)`
|
|
|
|
### Data minimization
|
|
- Default run detail surfaces only sanitized `failure_summary` + normalized `summary_counts`.
|
|
- `context` rendering remains sanitized/limited (avoid raw payload dumps by default).
|
|
|
|
## Alternatives considered
|
|
- New “SystemOperationRun” table: rejected; existing `OperationRun` is already the canonical monitoring artifact.
|
|
- Building Access Logs from web server logs: rejected; `AuditLog` already exists, is sanitized, and includes platform-auth events.
|