Feature branch PR for Spec 114. This branch contains the merged agent session work (see merge commit on branch). Tests - `vendor/bin/sail artisan test --compact tests/Feature/System/Spec114/` Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de> Reviewed-on: #139
98 lines
4.8 KiB
Markdown
98 lines
4.8 KiB
Markdown
# Phase 0 — Research (Spec 114: System Console Control Tower)
|
|
|
|
## Goal
|
|
Deliver a platform-operator “/system” control plane that is **strictly separated** from “/admin”, is **metadata-only by default**, and provides fast routing into canonical `OperationRun` detail.
|
|
|
|
## Existing primitives (reuse)
|
|
|
|
### System panel + plane separation
|
|
- `app/Providers/Filament/SystemPanelProvider.php`
|
|
- Panel: `id=system`, `path=system`, `authGuard('platform')`
|
|
- Uses `UseSystemSessionCookie` to isolate sessions from `/admin`
|
|
- Uses middleware `ensure-correct-guard:platform` and capability gate `ensure-platform-capability:<ACCESS_SYSTEM_PANEL>`
|
|
- `app/Http/Middleware/UseSystemSessionCookie.php`
|
|
- Implements Spec 114 clarification: separate session cookie name for `/system`
|
|
|
|
### Authorization semantics (404 vs 403)
|
|
- Existing tests already enforce the clarified behavior:
|
|
- Non-platform (wrong guard) → 404 (deny-as-not-found)
|
|
- Platform user missing capability → 403
|
|
|
|
### Operation runs (Monitoring source of truth)
|
|
- `app/Models/OperationRun.php` + migrations under `database/migrations/*operation_runs*`
|
|
- `workspace_id` is required; `tenant_id` is nullable (supports tenantless runs)
|
|
- `failure_summary`, `summary_counts`, `context` are JSON arrays and already used in UI
|
|
- `app/Services/OperationRunService.php`
|
|
- Canonical lifecycle transitions, summary-count normalization, failure sanitization
|
|
- Has stale queued run helper (`isStaleQueuedRun()` + `failStaleQueuedRun()`)
|
|
- Canonical System run links:
|
|
- `app/Support/System/SystemOperationRunLinks.php` (index + view)
|
|
|
|
### Sanitization / data minimization
|
|
- Failures: `app/Support/OpsUx/RunFailureSanitizer.php` (reason normalization + message redaction)
|
|
- Audit metadata: `app/Support/Audit/AuditContextSanitizer.php` (redacts token/secret/password-like keys + bearer/JWT strings)
|
|
|
|
### Access logs signal source
|
|
- `app/Models/AuditLog.php`
|
|
- System login auditing:
|
|
- `app/Filament/System/Pages/Auth/Login.php` writes `AuditLog` events with action `platform.auth.login`
|
|
- Break-glass auditing:
|
|
- `app/Services/Auth/BreakGlassSession.php` writes `platform.break_glass.enter|exit|expired`
|
|
|
|
## Key gaps to implement (Spec 114)
|
|
|
|
### Navigation/IA
|
|
- Add System pages:
|
|
- `/system/directory/workspaces` (+ detail)
|
|
- `/system/directory/tenants` (+ detail)
|
|
- `/system/ops/runs` (global) + canonical detail already exists but is currently *runbook-type scoped*
|
|
- `/system/ops/failures` (prefilter)
|
|
- `/system/ops/stuck` (prefilter)
|
|
- `/system/security/access-logs`
|
|
|
|
### RBAC (platform capabilities)
|
|
- `app/Support/Auth/PlatformCapabilities.php` currently contains only Ops/runbooks/break-glass/core panel access.
|
|
- Spec 114 introduces additional capabilities (e.g. `platform.console.view`, `platform.directory.view`, `platform.operations.manage`).
|
|
|
|
Decision:
|
|
- Extend `PlatformCapabilities` registry with Spec 114 capabilities and update system pages to gate via the registry constants (no raw strings).
|
|
|
|
### Stuck definition
|
|
- There is a helper for “stale queued” in `OperationRunService`, but no “running too long” classification.
|
|
|
|
Decision:
|
|
- Introduce configurable stuck thresholds for `queued` and `running` (minutes) under a single config namespace (e.g. `config/tenantpilot.php`), and implement stuck classification in a dedicated helper/service used by the System pages.
|
|
|
|
### Control Tower aggregation
|
|
- Spec 114 requires KPIs + top offenders in a selectable time window.
|
|
|
|
Decision:
|
|
- Use DB-only aggregation on `operation_runs` for the selected time window:
|
|
- KPIs: counts by outcome/status, and “failed/stuck” counts
|
|
- Top offenders: group by tenant/workspace for failed runs
|
|
- Default time window: 24h; supported: 1h/24h/7d
|
|
|
|
## Non-functional decisions (resolving “NEEDS CLARIFICATION”)
|
|
|
|
### Technical context (resolved)
|
|
- Language/runtime: PHP 8.4 (Laravel 12)
|
|
- Admin framework: Filament v5 + Livewire v4
|
|
- Storage: PostgreSQL (Sail locally)
|
|
- Testing: Pest v4
|
|
- Target: web app (server-rendered Livewire/Filament)
|
|
|
|
### Performance goals (assumptions, but explicit)
|
|
- System list pages are DB-only at render time; no external calls.
|
|
- Target: p95 < 1.0s for index pages at typical production volumes, using:
|
|
- time-window defaults (24h)
|
|
- pagination
|
|
- indexes for `operation_runs(status,outcome,created_at,type,workspace_id,tenant_id)` and `audit_logs(action,recorded_at,actor_id)`
|
|
|
|
### Data minimization
|
|
- Default run detail surfaces only sanitized `failure_summary` + normalized `summary_counts`.
|
|
- `context` rendering remains sanitized/limited (avoid raw payload dumps by default).
|
|
|
|
## Alternatives considered
|
|
- New “SystemOperationRun” table: rejected; existing `OperationRun` is already the canonical monitoring artifact.
|
|
- Building Access Logs from web server logs: rejected; `AuditLog` already exists, is sanitized, and includes platform-auth events.
|