TenantAtlas/specs/114-system-console-control-tower/research.md
ahmido 0cf612826f feat(114): system console control tower (merged) (#139)
Feature branch PR for Spec 114.

This branch contains the merged agent session work (see merge commit on branch).

Tests
- `vendor/bin/sail artisan test --compact tests/Feature/System/Spec114/`

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #139
2026-02-28 00:15:31 +00:00

4.8 KiB

Phase 0 — Research (Spec 114: System Console Control Tower)

Goal

Deliver a platform-operator “/system” control plane that is strictly separated from “/admin”, is metadata-only by default, and provides fast routing into canonical OperationRun detail.

Existing primitives (reuse)

System panel + plane separation

  • app/Providers/Filament/SystemPanelProvider.php
    • Panel: id=system, path=system, authGuard('platform')
    • Uses UseSystemSessionCookie to isolate sessions from /admin
    • Uses middleware ensure-correct-guard:platform and capability gate ensure-platform-capability:<ACCESS_SYSTEM_PANEL>
  • app/Http/Middleware/UseSystemSessionCookie.php
    • Implements Spec 114 clarification: separate session cookie name for /system

Authorization semantics (404 vs 403)

  • Existing tests already enforce the clarified behavior:
    • Non-platform (wrong guard) → 404 (deny-as-not-found)
    • Platform user missing capability → 403

Operation runs (Monitoring source of truth)

  • app/Models/OperationRun.php + migrations under database/migrations/*operation_runs*
    • workspace_id is required; tenant_id is nullable (supports tenantless runs)
    • failure_summary, summary_counts, context are JSON arrays and already used in UI
  • app/Services/OperationRunService.php
    • Canonical lifecycle transitions, summary-count normalization, failure sanitization
    • Has stale queued run helper (isStaleQueuedRun() + failStaleQueuedRun())
  • Canonical System run links:
    • app/Support/System/SystemOperationRunLinks.php (index + view)

Sanitization / data minimization

  • Failures: app/Support/OpsUx/RunFailureSanitizer.php (reason normalization + message redaction)
  • Audit metadata: app/Support/Audit/AuditContextSanitizer.php (redacts token/secret/password-like keys + bearer/JWT strings)

Access logs signal source

  • app/Models/AuditLog.php
  • System login auditing:
    • app/Filament/System/Pages/Auth/Login.php writes AuditLog events with action platform.auth.login
  • Break-glass auditing:
    • app/Services/Auth/BreakGlassSession.php writes platform.break_glass.enter|exit|expired

Key gaps to implement (Spec 114)

Navigation/IA

  • Add System pages:
    • /system/directory/workspaces (+ detail)
    • /system/directory/tenants (+ detail)
    • /system/ops/runs (global) + canonical detail already exists but is currently runbook-type scoped
    • /system/ops/failures (prefilter)
    • /system/ops/stuck (prefilter)
    • /system/security/access-logs

RBAC (platform capabilities)

  • app/Support/Auth/PlatformCapabilities.php currently contains only Ops/runbooks/break-glass/core panel access.
  • Spec 114 introduces additional capabilities (e.g. platform.console.view, platform.directory.view, platform.operations.manage).

Decision:

  • Extend PlatformCapabilities registry with Spec 114 capabilities and update system pages to gate via the registry constants (no raw strings).

Stuck definition

  • There is a helper for “stale queued” in OperationRunService, but no “running too long” classification.

Decision:

  • Introduce configurable stuck thresholds for queued and running (minutes) under a single config namespace (e.g. config/tenantpilot.php), and implement stuck classification in a dedicated helper/service used by the System pages.

Control Tower aggregation

  • Spec 114 requires KPIs + top offenders in a selectable time window.

Decision:

  • Use DB-only aggregation on operation_runs for the selected time window:
    • KPIs: counts by outcome/status, and “failed/stuck” counts
    • Top offenders: group by tenant/workspace for failed runs
    • Default time window: 24h; supported: 1h/24h/7d

Non-functional decisions (resolving “NEEDS CLARIFICATION”)

Technical context (resolved)

  • Language/runtime: PHP 8.4 (Laravel 12)
  • Admin framework: Filament v5 + Livewire v4
  • Storage: PostgreSQL (Sail locally)
  • Testing: Pest v4
  • Target: web app (server-rendered Livewire/Filament)

Performance goals (assumptions, but explicit)

  • System list pages are DB-only at render time; no external calls.
  • Target: p95 < 1.0s for index pages at typical production volumes, using:
    • time-window defaults (24h)
    • pagination
    • indexes for operation_runs(status,outcome,created_at,type,workspace_id,tenant_id) and audit_logs(action,recorded_at,actor_id)

Data minimization

  • Default run detail surfaces only sanitized failure_summary + normalized summary_counts.
  • context rendering remains sanitized/limited (avoid raw payload dumps by default).

Alternatives considered

  • New “SystemOperationRun” table: rejected; existing OperationRun is already the canonical monitoring artifact.
  • Building Access Logs from web server logs: rejected; AuditLog already exists, is sanitized, and includes platform-auth events.