Feature branch PR for Spec 114. This branch contains the merged agent session work (see merge commit on branch). Tests - `vendor/bin/sail artisan test --compact tests/Feature/System/Spec114/` Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de> Reviewed-on: #139
4.8 KiB
4.8 KiB
Phase 0 — Research (Spec 114: System Console Control Tower)
Goal
Deliver a platform-operator “/system” control plane that is strictly separated from “/admin”, is metadata-only by default, and provides fast routing into canonical OperationRun detail.
Existing primitives (reuse)
System panel + plane separation
app/Providers/Filament/SystemPanelProvider.php- Panel:
id=system,path=system,authGuard('platform') - Uses
UseSystemSessionCookieto isolate sessions from/admin - Uses middleware
ensure-correct-guard:platformand capability gateensure-platform-capability:<ACCESS_SYSTEM_PANEL>
- Panel:
app/Http/Middleware/UseSystemSessionCookie.php- Implements Spec 114 clarification: separate session cookie name for
/system
- Implements Spec 114 clarification: separate session cookie name for
Authorization semantics (404 vs 403)
- Existing tests already enforce the clarified behavior:
- Non-platform (wrong guard) → 404 (deny-as-not-found)
- Platform user missing capability → 403
Operation runs (Monitoring source of truth)
app/Models/OperationRun.php+ migrations underdatabase/migrations/*operation_runs*workspace_idis required;tenant_idis nullable (supports tenantless runs)failure_summary,summary_counts,contextare JSON arrays and already used in UI
app/Services/OperationRunService.php- Canonical lifecycle transitions, summary-count normalization, failure sanitization
- Has stale queued run helper (
isStaleQueuedRun()+failStaleQueuedRun())
- Canonical System run links:
app/Support/System/SystemOperationRunLinks.php(index + view)
Sanitization / data minimization
- Failures:
app/Support/OpsUx/RunFailureSanitizer.php(reason normalization + message redaction) - Audit metadata:
app/Support/Audit/AuditContextSanitizer.php(redacts token/secret/password-like keys + bearer/JWT strings)
Access logs signal source
app/Models/AuditLog.php- System login auditing:
app/Filament/System/Pages/Auth/Login.phpwritesAuditLogevents with actionplatform.auth.login
- Break-glass auditing:
app/Services/Auth/BreakGlassSession.phpwritesplatform.break_glass.enter|exit|expired
Key gaps to implement (Spec 114)
Navigation/IA
- Add System pages:
/system/directory/workspaces(+ detail)/system/directory/tenants(+ detail)/system/ops/runs(global) + canonical detail already exists but is currently runbook-type scoped/system/ops/failures(prefilter)/system/ops/stuck(prefilter)/system/security/access-logs
RBAC (platform capabilities)
app/Support/Auth/PlatformCapabilities.phpcurrently contains only Ops/runbooks/break-glass/core panel access.- Spec 114 introduces additional capabilities (e.g.
platform.console.view,platform.directory.view,platform.operations.manage).
Decision:
- Extend
PlatformCapabilitiesregistry with Spec 114 capabilities and update system pages to gate via the registry constants (no raw strings).
Stuck definition
- There is a helper for “stale queued” in
OperationRunService, but no “running too long” classification.
Decision:
- Introduce configurable stuck thresholds for
queuedandrunning(minutes) under a single config namespace (e.g.config/tenantpilot.php), and implement stuck classification in a dedicated helper/service used by the System pages.
Control Tower aggregation
- Spec 114 requires KPIs + top offenders in a selectable time window.
Decision:
- Use DB-only aggregation on
operation_runsfor the selected time window:- KPIs: counts by outcome/status, and “failed/stuck” counts
- Top offenders: group by tenant/workspace for failed runs
- Default time window: 24h; supported: 1h/24h/7d
Non-functional decisions (resolving “NEEDS CLARIFICATION”)
Technical context (resolved)
- Language/runtime: PHP 8.4 (Laravel 12)
- Admin framework: Filament v5 + Livewire v4
- Storage: PostgreSQL (Sail locally)
- Testing: Pest v4
- Target: web app (server-rendered Livewire/Filament)
Performance goals (assumptions, but explicit)
- System list pages are DB-only at render time; no external calls.
- Target: p95 < 1.0s for index pages at typical production volumes, using:
- time-window defaults (24h)
- pagination
- indexes for
operation_runs(status,outcome,created_at,type,workspace_id,tenant_id)andaudit_logs(action,recorded_at,actor_id)
Data minimization
- Default run detail surfaces only sanitized
failure_summary+ normalizedsummary_counts. contextrendering remains sanitized/limited (avoid raw payload dumps by default).
Alternatives considered
- New “SystemOperationRun” table: rejected; existing
OperationRunis already the canonical monitoring artifact. - Building Access Logs from web server logs: rejected;
AuditLogalready exists, is sanitized, and includes platform-auth events.