ahmido 0cf612826f feat(114): system console control tower (merged) (#139 )

Feature branch PR for Spec 114.

This branch contains the merged agent session work (see merge commit on branch).

Tests
- `vendor/bin/sail artisan test --compact tests/Feature/System/Spec114/`

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #139

2026-02-28 00:15:31 +00:00

4.8 KiB

Raw Blame History

Phase 0 — Research (Spec 114: System Console Control Tower)

Goal

Deliver a platform-operator “/system” control plane that is strictly separated from “/admin”, is metadata-only by default, and provides fast routing into canonical OperationRun detail.

Existing primitives (reuse)

System panel + plane separation

app/Providers/Filament/SystemPanelProvider.php
- Panel: id=system, path=system, authGuard('platform')
- Uses UseSystemSessionCookie to isolate sessions from /admin
- Uses middleware ensure-correct-guard:platform and capability gate ensure-platform-capability:<ACCESS_SYSTEM_PANEL>
app/Http/Middleware/UseSystemSessionCookie.php
- Implements Spec 114 clarification: separate session cookie name for /system

Authorization semantics (404 vs 403)

Existing tests already enforce the clarified behavior:
- Non-platform (wrong guard) → 404 (deny-as-not-found)
- Platform user missing capability → 403

Operation runs (Monitoring source of truth)

app/Models/OperationRun.php + migrations under database/migrations/*operation_runs*
- workspace_id is required; tenant_id is nullable (supports tenantless runs)
- failure_summary, summary_counts, context are JSON arrays and already used in UI
app/Services/OperationRunService.php
- Canonical lifecycle transitions, summary-count normalization, failure sanitization
- Has stale queued run helper (isStaleQueuedRun() + failStaleQueuedRun())
Canonical System run links:
- app/Support/System/SystemOperationRunLinks.php (index + view)

Sanitization / data minimization

Failures: app/Support/OpsUx/RunFailureSanitizer.php (reason normalization + message redaction)
Audit metadata: app/Support/Audit/AuditContextSanitizer.php (redacts token/secret/password-like keys + bearer/JWT strings)

Access logs signal source

app/Models/AuditLog.php
System login auditing:
- app/Filament/System/Pages/Auth/Login.php writes AuditLog events with action platform.auth.login
Break-glass auditing:
- app/Services/Auth/BreakGlassSession.php writes platform.break_glass.enter|exit|expired

Key gaps to implement (Spec 114)

Navigation/IA

Add System pages:
- /system/directory/workspaces (+ detail)
- /system/directory/tenants (+ detail)
- /system/ops/runs (global) + canonical detail already exists but is currently runbook-type scoped
- /system/ops/failures (prefilter)
- /system/ops/stuck (prefilter)
- /system/security/access-logs

RBAC (platform capabilities)

app/Support/Auth/PlatformCapabilities.php currently contains only Ops/runbooks/break-glass/core panel access.
Spec 114 introduces additional capabilities (e.g. platform.console.view, platform.directory.view, platform.operations.manage).

Decision:

Extend PlatformCapabilities registry with Spec 114 capabilities and update system pages to gate via the registry constants (no raw strings).

Stuck definition

There is a helper for “stale queued” in OperationRunService, but no “running too long” classification.

Decision:

Introduce configurable stuck thresholds for queued and running (minutes) under a single config namespace (e.g. config/tenantpilot.php), and implement stuck classification in a dedicated helper/service used by the System pages.

Control Tower aggregation

Spec 114 requires KPIs + top offenders in a selectable time window.

Decision:

Use DB-only aggregation on operation_runs for the selected time window:
- KPIs: counts by outcome/status, and “failed/stuck” counts
- Top offenders: group by tenant/workspace for failed runs
- Default time window: 24h; supported: 1h/24h/7d

Non-functional decisions (resolving “NEEDS CLARIFICATION”)

Technical context (resolved)

Language/runtime: PHP 8.4 (Laravel 12)
Admin framework: Filament v5 + Livewire v4
Storage: PostgreSQL (Sail locally)
Testing: Pest v4
Target: web app (server-rendered Livewire/Filament)

Performance goals (assumptions, but explicit)

System list pages are DB-only at render time; no external calls.
Target: p95 < 1.0s for index pages at typical production volumes, using:
- time-window defaults (24h)
- pagination
- indexes for operation_runs(status,outcome,created_at,type,workspace_id,tenant_id) and audit_logs(action,recorded_at,actor_id)

Data minimization

Default run detail surfaces only sanitized failure_summary + normalized summary_counts.
context rendering remains sanitized/limited (avoid raw payload dumps by default).

Alternatives considered

New “SystemOperationRun” table: rejected; existing OperationRun is already the canonical monitoring artifact.
Building Access Logs from web server logs: rejected; AuditLog already exists, is sanitized, and includes platform-auth events.

4.8 KiB Raw Blame History

Phase 0 — Research (Spec 114: System Console Control Tower)

Goal

Existing primitives (reuse)

System panel + plane separation

Authorization semantics (404 vs 403)

Operation runs (Monitoring source of truth)

Sanitization / data minimization

Access logs signal source

Key gaps to implement (Spec 114)

Navigation/IA

RBAC (platform capabilities)

Stuck definition

Control Tower aggregation

Non-functional decisions (resolving “NEEDS CLARIFICATION”)

Technical context (resolved)

Performance goals (assumptions, but explicit)

Data minimization

Alternatives considered

4.8 KiB

Raw Blame History