TenantAtlas/specs/114-system-console-control-tower/spec.md

# Feature Specification: System Console Control Tower (Platform Operator)

**Feature Branch**: `114-system-console-control-tower`
**Created**: 2026-02-27
**Status**: Draft
**Input**: Spec 114 — System Console Control Tower für Plattformbetreiber

## Spec Scope Fields *(mandatory)*

- **Scope**: canonical-view
- **Primary Routes**:
  - `/system` (alias) and `/system/dashboard` (Control Tower)
  - `/system/directory/workspaces` + workspace detail
  - `/system/directory/tenants` + tenant detail
  - `/system/ops/runs` + canonical run detail (`/system/ops/runs/{run}`)
  - `/system/ops/failures` (prefilter)
  - `/system/ops/stuck` (prefilter)
  - `/system/security/access-logs`
- **Data Ownership**: Platform-owned operational metadata across workspaces/tenants (health signals, run metadata, audit/access events). No customer policy payloads, secrets, or PII are presented by default.
- **RBAC**:
  - Access is limited to platform users only (platform guard).
  - Capability-based access:
    - `platform.console.view`
    - `platform.directory.view`
    - `platform.operations.view`
    - `platform.operations.manage` (enabled in v1)
    - `platform.runbooks.view` / `platform.runbooks.run` (integration point with Spec 113)

For canonical-view specs, the spec MUST define:

- **Default filter behavior when tenant-context is active**: Not applicable. `/system` has no tenant-context; it is platform-only.
- **Explicit entitlement checks preventing cross-tenant leakage**:
  - Any request not authenticated as a platform user is treated as “not found” (deny-as-not-found).
  - Listing and detail access is always gated by capabilities (view vs manage) and only exposes non-sensitive metadata.

## Clarifications

### Session 2026-02-27

- Q: Which session isolation should v1 implement for `/system` (SR-003)? → A: Same domain, but separate session cookie name for `/system`.
- Q: Should manage actions (Retry/Cancel/Mark investigated) be active in v1? → A: Yes. `platform.operations.manage` is in v1 with: Retry (retryable only), Cancel (supported only), Mark investigated (reason required).
- Q: Which 404 vs 403 semantics apply for `/system`? → A: Non-platform / wrong guard returns 404; platform user missing capability returns 403.

## User Scenarios & Testing *(mandatory)*

### User Story 1 - Global Health & Triage Entry (Priority: P1)

As a platform operator, I want a single Control Tower view that summarizes platform health and routes me to the most urgent issues, so I can triage failures quickly without exposing customer-sensitive data.

**Why this priority**: This is the primary operator workflow (“what’s broken right now?”) and the first screen that enables faster incident response.

**Independent Test**: A platform user can open the Control Tower, see KPIs/top offenders for a selected time window, and click through to a canonical run detail page.

**Acceptance Scenarios**:

1. **Given** a platform user with `platform.console.view`, **When** they open the Control Tower, **Then** they see KPI counts and “Top offenders” summaries for the selected time window.
2. **Given** a failed operation exists, **When** they click a “recently failed operation”, **Then** they land on the canonical run detail page.
3. **Given** a non-platform user, **When** they request any `/system/*` URL, **Then** the system does not reveal that the console exists.

---

### User Story 2 - Directory for Workspaces & Tenants (Priority: P2)

As a platform support engineer, I want a directory of workspaces and tenants with health signals and recent activity, so I can route issues to the right tenant/workspace and quickly inspect recent operations.

**Why this priority**: Most incidents are tenant-scoped; fast routing depends on a reliable cross-tenant directory with minimal data exposure.

**Independent Test**: A platform user can list workspaces/tenants, open details, and jump to run listings filtered to that tenant/workspace.

**Acceptance Scenarios**:

1. **Given** a platform user with `platform.directory.view`, **When** they view the Workspaces index, **Then** they can sort and filter by health and activity, and navigate to workspace details.
2. **Given** a tenant, **When** they view tenant details, **Then** they see connectivity/permissions status and recent operations as metadata-only summaries.
3. **Given** the UI provides an “Open in /admin” link, **When** a platform user clicks it, **Then** it is a plain URL only (no auto-login, no session bridging).

---

### User Story 3 - Operations Triage Actions & Auditability (Priority: P3)

As a privileged platform operator, I want to take safe triage actions on failed or stuck operation runs (retry/cancel/mark investigated), so I can restore platform health with guardrails and complete audit trails.

**Why this priority**: Operational actions are high-risk; they must be permission-gated and auditable.

**Independent Test**: A platform user with `platform.operations.manage` can perform an allowed triage action and observe that it is recorded, while a view-only user cannot.

**Acceptance Scenarios**:

1. **Given** a platform user without `platform.operations.manage`, **When** they view failures/stuck runs, **Then** they can inspect but cannot execute triage actions.
2. **Given** a platform user with `platform.operations.manage`, **When** they retry a retryable run, **Then** a new run is initiated and linked to the original for traceability.
3. **Given** a triage action is destructive or high blast-radius, **When** the operator attempts it, **Then** they must explicitly confirm (and provide a reason where required) before it executes.

### Edge Cases

- Large volumes of runs and tenants: list pages still load within an acceptable wait time and do not degrade into partial/inconsistent results.
- Missing or unknown health inputs: health is shown as “Unknown” or equivalent, not as a false “OK”.
- Stuck classification boundaries: a run right on the threshold is classified consistently.
- Sanitization: error/context summaries never reveal tokens, secrets, or policy payloads.
- Break-glass mode: all pages show an unmistakable banner and actions include the break-glass marker.

## Requirements *(mandatory)*

### Functional Requirements

- **FR-001 — Control Tower Dashboard (Global Health)**: The system MUST provide a Control Tower dashboard showing platform health within a selectable time window (default 24h; options include 1h/24h/7d), including KPI counts and “Top offenders” summaries.
- **FR-002 — Directory: Workspaces**: The system MUST provide a Workspaces index and workspace detail view that shows tenant counts, a health badge (OK/Warn/Critical/Unknown), last activity, and quick links to relevant views.
- **FR-003 — Directory: Tenants**: The system MUST provide a Tenants index and tenant detail view that shows provider connectivity status, permissions status, last sync/compare summaries as counts/metadata only, and runbook shortcuts where available.
- **FR-004 — Operations: Global Runs + Canonical Run Detail**: The system MUST provide a global operation runs view with filtering (status/type/workspace/tenant/time window/actor) and a single canonical run detail page used by all “View run” links.
- **FR-005 — Failures View (Prefiltered)**: The system MUST provide a failures view that prefilters to failed runs and groups/summarizes failures by run type and by tenant, enabling 1–2 click routing into run details.
- **FR-006 — Stuck Runs Definition & View**: The system MUST define and surface “stuck” runs based on configurable thresholds for “queued too long” and “running too long”, and present an operator view for investigating them. Any triage actions available from this surface MUST follow FR-006a.
- **FR-006a — Triage Actions (v1 enabled)**: For operators with `platform.operations.manage`, the system MUST provide triage actions in failures/stuck/run detail views, constrained as follows: Retry is available only for retryable run types; Cancel is available only where the run supports cancelation; “Mark investigated” requires a reason/note.
- **FR-007 — Runbook Shortcuts Integration**: The system MUST provide navigation to runbooks from the System Console navigation. The UI MAY provide scope-aware shortcuts from tenant/workspace/run details. If runbooks are not available yet, the UI MAY show “coming soon” placeholders.
- **FR-008 — Access Logs (Security, minimal v1)**: The system MUST provide an access log view for platform users that supports filtering by user/time/outcome and includes login successes/failures and break-glass activation events.
- **FR-009 — Export (optional)**: The system MAY allow exporting filtered run metadata as CSV without including sensitive context. (Deferred in v1.)

### Security, Privacy, and Guardrails

- **SR-001 — Guard Isolation**: `/system` MUST be accessible exclusively to platform users; non-platform access (wrong guard or unauthenticated) MUST behave as “not found” and MUST not reveal the presence of the console.
- **SR-001a — 404 vs 403 Semantics**: The system MUST apply the following response semantics consistently across `/system/*`:
  - Wrong guard / unauthenticated / not a platform user → 404 (deny-as-not-found)
  - Platform user authenticated but missing required capability → 403
- **SR-002 — Authentication Hardening**: The system MUST throttle excessive `/system` login attempts and MUST record failed attempts for later review. v1 throttle policy is: max 10 failed attempts per 60 seconds per `ip + email` (throttle key: `system-login:{ip}:{normalizedEmail}`), recording `reason` (e.g., `invalid_credentials`, `inactive`, `throttled`) under the `platform.auth.login` audit action.
- **SR-003 — Data Minimization by Default**: `/system` MUST avoid sensitive content by default (no raw policy payloads, secrets, tokens, or PII). Only counts, status badges, and sanitized summaries are shown.
- **SR-004 — Sensitive Drilldowns**: v1 MUST NOT provide raw error/context payload inspection in `/system`. If raw inspection is introduced later, it MUST be restricted behind elevated capability and require an operator-provided reason.
- **SR-005 — Break-Glass Guardrails**: When break-glass mode is active, the UI MUST show a persistent banner, require a reason, and annotate actions/logs as break-glass.
- **SR-006 — Session Isolation**: `/system` MUST use a separate session cookie name (distinct from `/admin`) to reduce cross-plane session coupling. `/system` MUST NOT reuse the customer/admin session cookie.
- **SR-007 — Manage Action Guardrails**: Any triage action that mutates state (retry/cancel/mark investigated) MUST be restricted to `platform.operations.manage`, MUST require explicit confirmation, and MUST record an audit trail including actor, scope, target run, and operator-provided reason where applicable.

### Assumptions

- A platform operator console exists as a separate plane from customer administration, and customer users must never see maintenance/ops screens.
- Operation execution is routed through a single auditable run model (operator actions are “initiated” and traceable).
- Health statuses are computed from multiple signals using a “worst wins” rule.

## UI Action Matrix *(mandatory when System Console UI is changed)*

| Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions |
|---|---|---|---|---|---|---|---|---|---|---|
| Control Tower Dashboard | `/system/dashboard` | Time window switcher | “Recently failed operations” items link to run detail | None | None | View operation runs | N/A | N/A | Yes (access) | Read-only KPIs and offender summaries; no sensitive payloads |
| Workspaces Index | `/system/directory/workspaces` | None | Click workspace name to open details | None | None | Clear filters | N/A | N/A | Yes (access) | Supports sort/filter by health/activity/tenant count |
| Workspace Detail | `/system/directory/workspaces/{workspace}` | “View tenants”, “View runs (filtered)” | Tenant list items link to tenant detail; runs link to canonical run detail | None | None | View runs (filtered) | N/A | N/A | Yes (access) | “Open in /admin” is URL-only; no session bridging |
| Tenants Index | `/system/directory/tenants` | None | Click tenant name to open details | None | None | Clear filters | N/A | N/A | Yes (access) | Shows health signals as badges and counts |
| Tenant Detail | `/system/directory/tenants/{tenant}` | Runbook shortcuts (if entitled) | Recent operations list links to canonical run detail | Optional: “Run health check” / “Run sync” (max 2 visible; can be “coming soon”) | None | View operation runs | N/A | N/A | Yes | Runbook actions require confirmation and capability gating |
| Operation Runs | `/system/ops/runs` | Filters | Row click or “View run” link to canonical run detail | “Retry” / “Cancel” (manage only; availability depends on run type/support) | “Retry selected”, “Cancel selected” (manage only; constrained) | Clear filters | N/A | N/A | Yes | Actions require explicit confirmation, may require reason, and are fully auditable |
| Run Detail (Canonical) | `/system/ops/runs/{run}` | “Related tenant/workspace”, “Similar failures”, “Go to runbooks” | Links to filtered views | “Retry” / “Cancel” (manage only; constrained) | None | N/A | N/A | N/A | Yes | Context/error panels are sanitized by default in v1 (raw drilldowns are not available in v1) |
| Failures View | `/system/ops/failures` | Filters | Links to canonical run detail and tenant/workspace | “Retry” (manage only; retryable only) | “Retry selected” (manage only; retryable only) | Clear filters | N/A | N/A | Yes | Pre-filter to failed; 1–2 click routing |
| Stuck Runs View | `/system/ops/stuck` | Filters | Links to canonical run detail | “Cancel”, “Mark investigated” (manage only; cancel only if supported) | “Cancel selected” (manage only; constrained) | Clear filters | N/A | N/A | Yes | “Mark investigated” requires a note/reason |
| Access Logs | `/system/security/access-logs` | Filters | None | None | None | Clear filters | N/A | N/A | Yes | Minimal v1 security visibility |

**Audit log interpretation**: In this matrix, “Audit log?” means security/audit events are visible via the Access Logs surface (login successes/failures, break-glass activation, and operator triage actions). It does not imply per-page view logging for every `/system` page.

### Key Entities *(include if feature involves data)*

- **Operation Run**: An auditable record of an operational activity, including type, scope (platform/workspace/tenant), actor, start/end timestamps, status/outcome, and a sanitized summary.
- **Workspace**: A customer workspace container, used for grouping tenants and operational scope.
- **Tenant**: A customer tenant within a workspace, including provider connectivity and governance signal summaries.
- **Platform User**: An internal operator identity with capability-based authorization.
- **Access Log**: A record of platform access and authentication-related security events.

## Success Criteria *(mandatory)*

### Measurable Outcomes

- **SC-001**: Platform operators can identify the top failing tenant and open the related canonical run detail in ≤ 2 clicks from the Control Tower.
- **SC-002**: The Control Tower and directory pages load in p95 < 1.0s for typical production volumes.
- **SC-003**: In a structured review of the `/system` UI, no customer-sensitive payloads (policy content, secrets, tokens, PII) are visible by default.
- **SC-004**: 100% of operator triage actions (retry/cancel/mark investigated) are permission-gated and leave a complete audit trail.
- **SC-005**: Non-platform users cannot discover `/system` routes via direct URL guessing (console behaves as not found).