TenantAtlas/specs/114-system-console-control-tower/spec.md
Ahmed Darrazi 875528cd35 feat(114): system console control tower
Implements Spec 114 System Console Control Tower pages, widgets, triage actions, directory views, and enterprise polish (badges, repair workspace owners table, health indicator).
2026-02-27 17:28:09 +01:00

16 KiB
Raw Blame History

Feature Specification: System Console Control Tower (Platform Operator)

Feature Branch: 114-system-console-control-tower
Created: 2026-02-27
Status: Draft
Input: Spec 114 — System Console Control Tower für Plattformbetreiber

Spec Scope Fields (mandatory)

  • Scope: canonical-view
  • Primary Routes:
    • /system (alias) and /system/dashboard (Control Tower)
    • /system/directory/workspaces + workspace detail
    • /system/directory/tenants + tenant detail
    • /system/ops/runs + canonical run detail (/system/ops/runs/{run})
    • /system/ops/failures (prefilter)
    • /system/ops/stuck (prefilter)
    • /system/security/access-logs
  • Data Ownership: Platform-owned operational metadata across workspaces/tenants (health signals, run metadata, audit/access events). No customer policy payloads, secrets, or PII are presented by default.
  • RBAC:
    • Access is limited to platform users only (platform guard).
    • Capability-based access:
      • platform.console.view
      • platform.directory.view
      • platform.operations.view
      • platform.operations.manage (enabled in v1)
      • platform.runbooks.view / platform.runbooks.run (integration point with Spec 113)

For canonical-view specs, the spec MUST define:

  • Default filter behavior when tenant-context is active: Not applicable. /system has no tenant-context; it is platform-only.
  • Explicit entitlement checks preventing cross-tenant leakage:
    • Any request not authenticated as a platform user is treated as “not found” (deny-as-not-found).
    • Listing and detail access is always gated by capabilities (view vs manage) and only exposes non-sensitive metadata.

Clarifications

Session 2026-02-27

  • Q: Which session isolation should v1 implement for /system (SR-003)? → A: Same domain, but separate session cookie name for /system.
  • Q: Should manage actions (Retry/Cancel/Mark investigated) be active in v1? → A: Yes. platform.operations.manage is in v1 with: Retry (retryable only), Cancel (supported only), Mark investigated (reason required).
  • Q: Which 404 vs 403 semantics apply for /system? → A: Non-platform / wrong guard returns 404; platform user missing capability returns 403.

User Scenarios & Testing (mandatory)

User Story 1 - Global Health & Triage Entry (Priority: P1)

As a platform operator, I want a single Control Tower view that summarizes platform health and routes me to the most urgent issues, so I can triage failures quickly without exposing customer-sensitive data.

Why this priority: This is the primary operator workflow (“whats broken right now?”) and the first screen that enables faster incident response.

Independent Test: A platform user can open the Control Tower, see KPIs/top offenders for a selected time window, and click through to a canonical run detail page.

Acceptance Scenarios:

  1. Given a platform user with platform.console.view, When they open the Control Tower, Then they see KPI counts and “Top offenders” summaries for the selected time window.
  2. Given a failed operation exists, When they click a “recently failed operation”, Then they land on the canonical run detail page.
  3. Given a non-platform user, When they request any /system/* URL, Then the system does not reveal that the console exists.

User Story 2 - Directory for Workspaces & Tenants (Priority: P2)

As a platform support engineer, I want a directory of workspaces and tenants with health signals and recent activity, so I can route issues to the right tenant/workspace and quickly inspect recent operations.

Why this priority: Most incidents are tenant-scoped; fast routing depends on a reliable cross-tenant directory with minimal data exposure.

Independent Test: A platform user can list workspaces/tenants, open details, and jump to run listings filtered to that tenant/workspace.

Acceptance Scenarios:

  1. Given a platform user with platform.directory.view, When they view the Workspaces index, Then they can sort and filter by health and activity, and navigate to workspace details.
  2. Given a tenant, When they view tenant details, Then they see connectivity/permissions status and recent operations as metadata-only summaries.
  3. Given the UI provides an “Open in /admin” link, When a platform user clicks it, Then it is a plain URL only (no auto-login, no session bridging).

User Story 3 - Operations Triage Actions & Auditability (Priority: P3)

As a privileged platform operator, I want to take safe triage actions on failed or stuck operation runs (retry/cancel/mark investigated), so I can restore platform health with guardrails and complete audit trails.

Why this priority: Operational actions are high-risk; they must be permission-gated and auditable.

Independent Test: A platform user with platform.operations.manage can perform an allowed triage action and observe that it is recorded, while a view-only user cannot.

Acceptance Scenarios:

  1. Given a platform user without platform.operations.manage, When they view failures/stuck runs, Then they can inspect but cannot execute triage actions.
  2. Given a platform user with platform.operations.manage, When they retry a retryable run, Then a new run is initiated and linked to the original for traceability.
  3. Given a triage action is destructive or high blast-radius, When the operator attempts it, Then they must explicitly confirm (and provide a reason where required) before it executes.

Edge Cases

  • Large volumes of runs and tenants: list pages still load within an acceptable wait time and do not degrade into partial/inconsistent results.
  • Missing or unknown health inputs: health is shown as “Unknown” or equivalent, not as a false “OK”.
  • Stuck classification boundaries: a run right on the threshold is classified consistently.
  • Sanitization: error/context summaries never reveal tokens, secrets, or policy payloads.
  • Break-glass mode: all pages show an unmistakable banner and actions include the break-glass marker.

Requirements (mandatory)

Functional Requirements

  • FR-001 — Control Tower Dashboard (Global Health): The system MUST provide a Control Tower dashboard showing platform health within a selectable time window (default 24h; options include 1h/24h/7d), including KPI counts and “Top offenders” summaries.
  • FR-002 — Directory: Workspaces: The system MUST provide a Workspaces index and workspace detail view that shows tenant counts, a health badge (OK/Warn/Critical/Unknown), last activity, and quick links to relevant views.
  • FR-003 — Directory: Tenants: The system MUST provide a Tenants index and tenant detail view that shows provider connectivity status, permissions status, last sync/compare summaries as counts/metadata only, and runbook shortcuts where available.
  • FR-004 — Operations: Global Runs + Canonical Run Detail: The system MUST provide a global operation runs view with filtering (status/type/workspace/tenant/time window/actor) and a single canonical run detail page used by all “View run” links.
  • FR-005 — Failures View (Prefiltered): The system MUST provide a failures view that prefilters to failed runs and groups/summarizes failures by run type and by tenant, enabling 12 click routing into run details.
  • FR-006 — Stuck Runs Definition & View: The system MUST define and surface “stuck” runs based on configurable thresholds for “queued too long” and “running too long”, and present an operator view for investigating them. Any triage actions available from this surface MUST follow FR-006a.
  • FR-006a — Triage Actions (v1 enabled): For operators with platform.operations.manage, the system MUST provide triage actions in failures/stuck/run detail views, constrained as follows: Retry is available only for retryable run types; Cancel is available only where the run supports cancelation; “Mark investigated” requires a reason/note.
  • FR-007 — Runbook Shortcuts Integration: The system MUST provide navigation to runbooks from the System Console navigation. The UI MAY provide scope-aware shortcuts from tenant/workspace/run details. If runbooks are not available yet, the UI MAY show “coming soon” placeholders.
  • FR-008 — Access Logs (Security, minimal v1): The system MUST provide an access log view for platform users that supports filtering by user/time/outcome and includes login successes/failures and break-glass activation events.
  • FR-009 — Export (optional): The system MAY allow exporting filtered run metadata as CSV without including sensitive context. (Deferred in v1.)

Security, Privacy, and Guardrails

  • SR-001 — Guard Isolation: /system MUST be accessible exclusively to platform users; non-platform access (wrong guard or unauthenticated) MUST behave as “not found” and MUST not reveal the presence of the console.
  • SR-001a — 404 vs 403 Semantics: The system MUST apply the following response semantics consistently across /system/*:
    • Wrong guard / unauthenticated / not a platform user → 404 (deny-as-not-found)
    • Platform user authenticated but missing required capability → 403
  • SR-002 — Authentication Hardening: The system MUST throttle excessive /system login attempts and MUST record failed attempts for later review. v1 throttle policy is: max 10 failed attempts per 60 seconds per ip + email (throttle key: system-login:{ip}:{normalizedEmail}), recording reason (e.g., invalid_credentials, inactive, throttled) under the platform.auth.login audit action.
  • SR-003 — Data Minimization by Default: /system MUST avoid sensitive content by default (no raw policy payloads, secrets, tokens, or PII). Only counts, status badges, and sanitized summaries are shown.
  • SR-004 — Sensitive Drilldowns: v1 MUST NOT provide raw error/context payload inspection in /system. If raw inspection is introduced later, it MUST be restricted behind elevated capability and require an operator-provided reason.
  • SR-005 — Break-Glass Guardrails: When break-glass mode is active, the UI MUST show a persistent banner, require a reason, and annotate actions/logs as break-glass.
  • SR-006 — Session Isolation: /system MUST use a separate session cookie name (distinct from /admin) to reduce cross-plane session coupling. /system MUST NOT reuse the customer/admin session cookie.
  • SR-007 — Manage Action Guardrails: Any triage action that mutates state (retry/cancel/mark investigated) MUST be restricted to platform.operations.manage, MUST require explicit confirmation, and MUST record an audit trail including actor, scope, target run, and operator-provided reason where applicable.

Assumptions

  • A platform operator console exists as a separate plane from customer administration, and customer users must never see maintenance/ops screens.
  • Operation execution is routed through a single auditable run model (operator actions are “initiated” and traceable).
  • Health statuses are computed from multiple signals using a “worst wins” rule.

UI Action Matrix (mandatory when System Console UI is changed)

Surface Location Header Actions Inspect Affordance (List/Table) Row Actions (max 2 visible) Bulk Actions (grouped) Empty-State CTA(s) View Header Actions Create/Edit Save+Cancel Audit log? Notes / Exemptions
Control Tower Dashboard /system/dashboard Time window switcher “Recently failed operations” items link to run detail None None View operation runs N/A N/A Yes (access) Read-only KPIs and offender summaries; no sensitive payloads
Workspaces Index /system/directory/workspaces None Click workspace name to open details None None Clear filters N/A N/A Yes (access) Supports sort/filter by health/activity/tenant count
Workspace Detail /system/directory/workspaces/{workspace} “View tenants”, “View runs (filtered)” Tenant list items link to tenant detail; runs link to canonical run detail None None View runs (filtered) N/A N/A Yes (access) “Open in /admin” is URL-only; no session bridging
Tenants Index /system/directory/tenants None Click tenant name to open details None None Clear filters N/A N/A Yes (access) Shows health signals as badges and counts
Tenant Detail /system/directory/tenants/{tenant} Runbook shortcuts (if entitled) Recent operations list links to canonical run detail Optional: “Run health check” / “Run sync” (max 2 visible; can be “coming soon”) None View operation runs N/A N/A Yes Runbook actions require confirmation and capability gating
Operation Runs /system/ops/runs Filters Row click or “View run” link to canonical run detail “Retry” / “Cancel” (manage only; availability depends on run type/support) “Retry selected”, “Cancel selected” (manage only; constrained) Clear filters N/A N/A Yes Actions require explicit confirmation, may require reason, and are fully auditable
Run Detail (Canonical) /system/ops/runs/{run} “Related tenant/workspace”, “Similar failures”, “Go to runbooks” Links to filtered views “Retry” / “Cancel” (manage only; constrained) None N/A N/A N/A Yes Context/error panels are sanitized by default in v1 (raw drilldowns are not available in v1)
Failures View /system/ops/failures Filters Links to canonical run detail and tenant/workspace “Retry” (manage only; retryable only) “Retry selected” (manage only; retryable only) Clear filters N/A N/A Yes Pre-filter to failed; 12 click routing
Stuck Runs View /system/ops/stuck Filters Links to canonical run detail “Cancel”, “Mark investigated” (manage only; cancel only if supported) “Cancel selected” (manage only; constrained) Clear filters N/A N/A Yes “Mark investigated” requires a note/reason
Access Logs /system/security/access-logs Filters None None None Clear filters N/A N/A Yes Minimal v1 security visibility

Audit log interpretation: In this matrix, “Audit log?” means security/audit events are visible via the Access Logs surface (login successes/failures, break-glass activation, and operator triage actions). It does not imply per-page view logging for every /system page.

Key Entities (include if feature involves data)

  • Operation Run: An auditable record of an operational activity, including type, scope (platform/workspace/tenant), actor, start/end timestamps, status/outcome, and a sanitized summary.
  • Workspace: A customer workspace container, used for grouping tenants and operational scope.
  • Tenant: A customer tenant within a workspace, including provider connectivity and governance signal summaries.
  • Platform User: An internal operator identity with capability-based authorization.
  • Access Log: A record of platform access and authentication-related security events.

Success Criteria (mandatory)

Measurable Outcomes

  • SC-001: Platform operators can identify the top failing tenant and open the related canonical run detail in ≤ 2 clicks from the Control Tower.
  • SC-002: The Control Tower and directory pages load in p95 < 1.0s for typical production volumes.
  • SC-003: In a structured review of the /system UI, no customer-sensitive payloads (policy content, secrets, tokens, PII) are visible by default.
  • SC-004: 100% of operator triage actions (retry/cancel/mark investigated) are permission-gated and leave a complete audit trail.
  • SC-005: Non-platform users cannot discover /system routes via direct URL guessing (console behaves as not found).