TenantAtlas/specs/114-system-console-control-tower/research.md
ahmido 0cf612826f feat(114): system console control tower (merged) (#139)
Feature branch PR for Spec 114.

This branch contains the merged agent session work (see merge commit on branch).

Tests
- `vendor/bin/sail artisan test --compact tests/Feature/System/Spec114/`

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #139
2026-02-28 00:15:31 +00:00

98 lines
4.8 KiB
Markdown

# Phase 0 — Research (Spec 114: System Console Control Tower)
## Goal
Deliver a platform-operator “/system” control plane that is **strictly separated** from “/admin”, is **metadata-only by default**, and provides fast routing into canonical `OperationRun` detail.
## Existing primitives (reuse)
### System panel + plane separation
- `app/Providers/Filament/SystemPanelProvider.php`
- Panel: `id=system`, `path=system`, `authGuard('platform')`
- Uses `UseSystemSessionCookie` to isolate sessions from `/admin`
- Uses middleware `ensure-correct-guard:platform` and capability gate `ensure-platform-capability:<ACCESS_SYSTEM_PANEL>`
- `app/Http/Middleware/UseSystemSessionCookie.php`
- Implements Spec 114 clarification: separate session cookie name for `/system`
### Authorization semantics (404 vs 403)
- Existing tests already enforce the clarified behavior:
- Non-platform (wrong guard) → 404 (deny-as-not-found)
- Platform user missing capability → 403
### Operation runs (Monitoring source of truth)
- `app/Models/OperationRun.php` + migrations under `database/migrations/*operation_runs*`
- `workspace_id` is required; `tenant_id` is nullable (supports tenantless runs)
- `failure_summary`, `summary_counts`, `context` are JSON arrays and already used in UI
- `app/Services/OperationRunService.php`
- Canonical lifecycle transitions, summary-count normalization, failure sanitization
- Has stale queued run helper (`isStaleQueuedRun()` + `failStaleQueuedRun()`)
- Canonical System run links:
- `app/Support/System/SystemOperationRunLinks.php` (index + view)
### Sanitization / data minimization
- Failures: `app/Support/OpsUx/RunFailureSanitizer.php` (reason normalization + message redaction)
- Audit metadata: `app/Support/Audit/AuditContextSanitizer.php` (redacts token/secret/password-like keys + bearer/JWT strings)
### Access logs signal source
- `app/Models/AuditLog.php`
- System login auditing:
- `app/Filament/System/Pages/Auth/Login.php` writes `AuditLog` events with action `platform.auth.login`
- Break-glass auditing:
- `app/Services/Auth/BreakGlassSession.php` writes `platform.break_glass.enter|exit|expired`
## Key gaps to implement (Spec 114)
### Navigation/IA
- Add System pages:
- `/system/directory/workspaces` (+ detail)
- `/system/directory/tenants` (+ detail)
- `/system/ops/runs` (global) + canonical detail already exists but is currently *runbook-type scoped*
- `/system/ops/failures` (prefilter)
- `/system/ops/stuck` (prefilter)
- `/system/security/access-logs`
### RBAC (platform capabilities)
- `app/Support/Auth/PlatformCapabilities.php` currently contains only Ops/runbooks/break-glass/core panel access.
- Spec 114 introduces additional capabilities (e.g. `platform.console.view`, `platform.directory.view`, `platform.operations.manage`).
Decision:
- Extend `PlatformCapabilities` registry with Spec 114 capabilities and update system pages to gate via the registry constants (no raw strings).
### Stuck definition
- There is a helper for “stale queued” in `OperationRunService`, but no “running too long” classification.
Decision:
- Introduce configurable stuck thresholds for `queued` and `running` (minutes) under a single config namespace (e.g. `config/tenantpilot.php`), and implement stuck classification in a dedicated helper/service used by the System pages.
### Control Tower aggregation
- Spec 114 requires KPIs + top offenders in a selectable time window.
Decision:
- Use DB-only aggregation on `operation_runs` for the selected time window:
- KPIs: counts by outcome/status, and “failed/stuck” counts
- Top offenders: group by tenant/workspace for failed runs
- Default time window: 24h; supported: 1h/24h/7d
## Non-functional decisions (resolving “NEEDS CLARIFICATION”)
### Technical context (resolved)
- Language/runtime: PHP 8.4 (Laravel 12)
- Admin framework: Filament v5 + Livewire v4
- Storage: PostgreSQL (Sail locally)
- Testing: Pest v4
- Target: web app (server-rendered Livewire/Filament)
### Performance goals (assumptions, but explicit)
- System list pages are DB-only at render time; no external calls.
- Target: p95 < 1.0s for index pages at typical production volumes, using:
- time-window defaults (24h)
- pagination
- indexes for `operation_runs(status,outcome,created_at,type,workspace_id,tenant_id)` and `audit_logs(action,recorded_at,actor_id)`
### Data minimization
- Default run detail surfaces only sanitized `failure_summary` + normalized `summary_counts`.
- `context` rendering remains sanitized/limited (avoid raw payload dumps by default).
## Alternatives considered
- New SystemOperationRun table: rejected; existing `OperationRun` is already the canonical monitoring artifact.
- Building Access Logs from web server logs: rejected; `AuditLog` already exists, is sanitized, and includes platform-auth events.