TenantAtlas/specs/114-system-console-control-tower/research.md

# Phase 0 — Research (Spec 114: System Console Control Tower)

## Goal
Deliver a platform-operator “/system” control plane that is **strictly separated** from “/admin”, is **metadata-only by default**, and provides fast routing into canonical `OperationRun` detail.

## Existing primitives (reuse)

### System panel + plane separation
- `app/Providers/Filament/SystemPanelProvider.php`
  - Panel: `id=system`, `path=system`, `authGuard('platform')`
  - Uses `UseSystemSessionCookie` to isolate sessions from `/admin`
  - Uses middleware `ensure-correct-guard:platform` and capability gate `ensure-platform-capability:<ACCESS_SYSTEM_PANEL>`
- `app/Http/Middleware/UseSystemSessionCookie.php`
  - Implements Spec 114 clarification: separate session cookie name for `/system`

### Authorization semantics (404 vs 403)
- Existing tests already enforce the clarified behavior:
  - Non-platform (wrong guard) → 404 (deny-as-not-found)
  - Platform user missing capability → 403

### Operation runs (Monitoring source of truth)
- `app/Models/OperationRun.php` + migrations under `database/migrations/*operation_runs*`
  - `workspace_id` is required; `tenant_id` is nullable (supports tenantless runs)
  - `failure_summary`, `summary_counts`, `context` are JSON arrays and already used in UI
- `app/Services/OperationRunService.php`
  - Canonical lifecycle transitions, summary-count normalization, failure sanitization
  - Has stale queued run helper (`isStaleQueuedRun()` + `failStaleQueuedRun()`)
- Canonical System run links:
  - `app/Support/System/SystemOperationRunLinks.php` (index + view)

### Sanitization / data minimization
- Failures: `app/Support/OpsUx/RunFailureSanitizer.php` (reason normalization + message redaction)
- Audit metadata: `app/Support/Audit/AuditContextSanitizer.php` (redacts token/secret/password-like keys + bearer/JWT strings)

### Access logs signal source
- `app/Models/AuditLog.php`
- System login auditing:
  - `app/Filament/System/Pages/Auth/Login.php` writes `AuditLog` events with action `platform.auth.login`
- Break-glass auditing:
  - `app/Services/Auth/BreakGlassSession.php` writes `platform.break_glass.enter|exit|expired`

## Key gaps to implement (Spec 114)

### Navigation/IA
- Add System pages:
  - `/system/directory/workspaces` (+ detail)
  - `/system/directory/tenants` (+ detail)
  - `/system/ops/runs` (global) + canonical detail already exists but is currently *runbook-type scoped*
  - `/system/ops/failures` (prefilter)
  - `/system/ops/stuck` (prefilter)
  - `/system/security/access-logs`

### RBAC (platform capabilities)
- `app/Support/Auth/PlatformCapabilities.php` currently contains only Ops/runbooks/break-glass/core panel access.
- Spec 114 introduces additional capabilities (e.g. `platform.console.view`, `platform.directory.view`, `platform.operations.manage`).

Decision:
- Extend `PlatformCapabilities` registry with Spec 114 capabilities and update system pages to gate via the registry constants (no raw strings).

### Stuck definition
- There is a helper for “stale queued” in `OperationRunService`, but no “running too long” classification.

Decision:
- Introduce configurable stuck thresholds for `queued` and `running` (minutes) under a single config namespace (e.g. `config/tenantpilot.php`), and implement stuck classification in a dedicated helper/service used by the System pages.

### Control Tower aggregation
- Spec 114 requires KPIs + top offenders in a selectable time window.

Decision:
- Use DB-only aggregation on `operation_runs` for the selected time window:
  - KPIs: counts by outcome/status, and “failed/stuck” counts
  - Top offenders: group by tenant/workspace for failed runs
  - Default time window: 24h; supported: 1h/24h/7d

## Non-functional decisions (resolving “NEEDS CLARIFICATION”)

### Technical context (resolved)
- Language/runtime: PHP 8.4 (Laravel 12)
- Admin framework: Filament v5 + Livewire v4
- Storage: PostgreSQL (Sail locally)
- Testing: Pest v4
- Target: web app (server-rendered Livewire/Filament)

### Performance goals (assumptions, but explicit)
- System list pages are DB-only at render time; no external calls.
- Target: p95 < 1.0s for index pages at typical production volumes, using:
  - time-window defaults (24h)
  - pagination
  - indexes for `operation_runs(status,outcome,created_at,type,workspace_id,tenant_id)` and `audit_logs(action,recorded_at,actor_id)`

### Data minimization
- Default run detail surfaces only sanitized `failure_summary` + normalized `summary_counts`.
- `context` rendering remains sanitized/limited (avoid raw payload dumps by default).

## Alternatives considered
- New “SystemOperationRun” table: rejected; existing `OperationRun` is already the canonical monitoring artifact.
- Building Access Logs from web server logs: rejected; `AuditLog` already exists, is sanitized, and includes platform-auth events.