TenantAtlas/specs/115-baseline-operability-alerts/spec.md

# Feature Specification: Baseline Operability & Alert Integration (R1.1–R1.4 Extension)

**Feature Branch**: `115-baseline-operability-alerts`
**Created**: 2026-02-28
**Status**: Ready (Design complete; implementation pending)
**Input**: User description: "115 — Baseline Operability & Alert Integration (R1.1–R1.4 Extension)"

## Spec Scope Fields *(mandatory)*

- **Scope**: workspace (management) + tenant (monitoring)
- **Primary Routes**:
  - Workspace (admin): Baselines management (Baseline Profiles) and Workspace Settings
  - Tenant-context (admin): Baseline Compare monitoring and Baseline Compare “run now” surface
- **Data Ownership**:
  - Workspace-owned: Baseline profiles, baseline-to-tenant assignments, workspace settings, alert rules
  - Tenant-scoped (within a workspace): Findings produced by baseline compare; operation runs for tenant compares
- **RBAC**:
  - Workspace Baselines (view/manage): workspace members must be granted the workspace baselines view/manage capabilities (from the canonical capability registry)
  - Workspace Settings (view/manage): workspace members must be granted the workspace settings view/manage capabilities (from the canonical capability registry)
  - Alerts (view/manage): workspace members must be granted the alerts view/manage capabilities (from the canonical capability registry)
  - Tenant monitoring surfaces require tenant access (tenant view) in addition to workspace membership

For canonical-view specs: not applicable (this is not a canonical-view feature).

## Clarifications

### Session 2026-02-28

- Q: What should `baseline.severity_mapping` map from? → A: Baseline drift `change_type` only (keys: `missing_policy`, `different_version`, `unexpected_policy`).
- Q: What is the canonical “fully successful compare” gate for auto-close? → A: Outcome `succeeded` AND Ops-UX canonical `OperationRun.summary_counts` gate: `summary_counts.processed == summary_counts.total` AND `summary_counts.failed == 0`.
- Q: When a previously resolved baseline finding reappears, what status should it transition to? → A: `reopened`.
- Q: For `baseline_compare_failed` alerts, what cooldown behavior applies? → A: Use the existing dispatcher cooldown (no baseline-specific cooldown setting).
- Q: Should `baseline.auto_close_enabled` exist as a kill-switch? → A: Yes; keep it with default `true`.

### Definitions

- **Baseline finding**: A drift finding produced by comparing a tenant against a baseline.
- **Fingerprint**: A stable identifier for “the same underlying issue” across runs, used for idempotency and alert deduplication.
- **Fully successful compare**: A compare run that succeeded and is complete (no failed items and all expected items processed).
  - Completeness is proven via Ops-UX canonical `OperationRun.summary_counts` counters: `summary_counts.processed == summary_counts.total` and `summary_counts.failed == 0`.

## User Scenarios & Testing *(mandatory)*

### User Story 1 - Safe auto-close removes stale baseline drift (Priority: P1)

As an MSP operator, I want baseline drift findings to automatically resolve when drift is no longer present, so the findings list remains actionable and doesn’t accumulate noise.

**Why this priority**: This is the main “operability” gap; without auto-close, drift remediation cannot be reliably observed, and alerting becomes noisy.

**Independent Test**: Can be fully tested by running a baseline compare that produces a “seen set” of fingerprints and verifying that previously-open baseline findings not present in the seen set are resolved only when the compare is fully successful.

**Acceptance Scenarios**:

1. **Given** an open baseline finding with `source = baseline.compare`, **When** a baseline compare completes fully successfully and that finding’s fingerprint is not present in the current compare result, **Then** the finding becomes `resolved` with reason `no_longer_drifting`.
2. **Given** an open baseline finding with `source = baseline.compare`, **When** a baseline compare is `partially_succeeded` or failed (or incomplete), **Then** no baseline findings are auto-resolved.

---

### User Story 2 - Baseline alerts are precise and deduplicated (Priority: P1)

As an on-call operator, I want alerts for baseline drift and baseline compare failures to trigger only when there is new actionable work (new or reopened findings) or when a compare fails, so I don’t get spammed on every run.

**Why this priority**: MSP sellability depends on trust in alerts; repeated “same problem” alerts make alerting unusable.

**Independent Test**: Can be fully tested by creating findings with controlled timestamps and statuses (new/reopened/open) and verifying that only new/reopened findings generate baseline drift alert events, while repeated compares do not.

**Acceptance Scenarios**:

1. **Given** a new high-severity baseline drift finding (deduped by fingerprint) created after the alert window start, **When** alerts are evaluated, **Then** a `baseline_high_drift` event is produced exactly once.
2. **Given** a resolved baseline drift finding that later reappears and transitions to `reopened`, **When** alerts are evaluated, **Then** a `baseline_high_drift` event is produced again exactly once.
3. **Given** a baseline compare run that completes with outcome failed or `partially_succeeded`, **When** alerts are evaluated, **Then** a `baseline_compare_failed` event is produced (deduped by run identity + cooldown).

---

### User Story 3 - Workspace-controlled severity mapping and alert threshold (Priority: P2)

As a workspace admin, I want to configure how baseline drift categories map to severity, and optionally the minimum severity that triggers baseline drift alerts, so the system matches the MSP’s operational standards.

**Why this priority**: Settings are required for enterprise adoption; hardcoded severity and alert thresholds don’t fit different environments.

**Independent Test**: Can be fully tested by setting workspace overrides and verifying that newly created baseline findings inherit the configured severity and that alert generation respects the configured minimum severity.

**Acceptance Scenarios**:

1. **Given** a workspace-level baseline severity mapping override, **When** a new baseline drift finding is created, **Then** it uses the mapped severity (and rejects invalid severity values).
2. **Given** a workspace-level baseline alert minimum severity override, **When** baseline findings are evaluated for alerts, **Then** only findings meeting or exceeding that threshold emit `baseline_high_drift` events.

### Edge Cases

- Compare completes but is not “fully successful” (e.g., `summary_counts.failed > 0`, incomplete processing where `summary_counts.processed != summary_counts.total`, or compare preconditions prevent the run from being created): auto-close MUST NOT occur.
- Compare does not evaluate all assigned items (e.g., missing baseline snapshot or assignment changes mid-run): auto-close MUST NOT resolve findings for items not evaluated.
- A baseline finding was resolved previously and reappears later: it must transition to an actionable open state (e.g., `reopened`) and be eligible for alerting once.
- Workspace settings payload is malformed (unknown drift categories or invalid severity values): save MUST be rejected and effective values MUST remain unchanged.

## Requirements *(mandatory)*

**Constitution alignment (required):** If this feature introduces any Microsoft Graph calls, any write/change behavior,
or any long-running/queued/scheduled work, the spec MUST describe contract registry updates, safety gates
(preview/confirmation/audit), tenant isolation, run observability (`OperationRun` type/identity/visibility), and tests.
If security-relevant DB-only actions intentionally skip `OperationRun`, the spec MUST describe `AuditLog` entries.

**Constitution alignment (OPS-UX):** If this feature creates/reuses an `OperationRun`, the spec MUST:
- explicitly state compliance with the Ops-UX 3-surface feedback contract (toast intent-only, progress surfaces, terminal DB notification),
- state that `OperationRun.status` / `OperationRun.outcome` transitions are service-owned (only via `OperationRunService`),
- describe how `summary_counts` keys/values comply with `OperationSummaryKeys::all()` and numeric-only rules,
- clarify scheduled/system-run behavior (initiator null → no terminal DB notification; audit is via Monitoring),
- list which regression guard tests are added/updated to keep these rules enforceable in CI.

**Constitution alignment (RBAC-UX):** If this feature introduces or changes authorization behavior, the spec MUST:
- state which authorization plane(s) are involved (tenant/admin `/admin` + tenant-context `/admin/t/{tenant}/...` vs platform `/system`),
- ensure any cross-plane access is deny-as-not-found (404),
- explicitly define 404 vs 403 semantics:
  - non-member / not entitled to workspace scope OR tenant scope → 404 (deny-as-not-found)
  - member but missing capability → 403
- describe how authorization is enforced server-side (Gates/Policies) for every mutation/operation-start/credential change,
- reference the canonical capability registry (no raw capability strings; no role-string checks in feature code),
- ensure global search is tenant-scoped and non-member-safe (no hints; inaccessible results treated as 404 semantics),
- ensure destructive-like actions require confirmation (`->requiresConfirmation()`),
- include at least one positive and one negative authorization test, and note any RBAC regression tests added/updated.

**Constitution alignment (OPS-EX-AUTH-001):** OIDC/SAML login handshakes may perform synchronous outbound HTTP (e.g., token exchange)
on `/auth/*` endpoints without an `OperationRun`. This MUST NOT be used for Monitoring/Operations pages.

**Constitution alignment (BADGE-001):** If this feature changes status-like badges (status/outcome/severity/risk/availability/boolean),
the spec MUST describe how badge semantics stay centralized (no ad-hoc mappings) and which tests cover any new/changed values.

**Constitution alignment (Filament Action Surfaces):** If this feature adds or modifies any Filament Resource / RelationManager / Page,
the spec MUST include a “UI Action Matrix” (see below) and explicitly state whether the Action Surface Contract is satisfied.
If the contract is not satisfied, the spec MUST include an explicit exemption with rationale.
**Constitution alignment (UX-001 — Layout & Information Architecture):** If this feature adds or modifies any Filament screen,
the spec MUST describe compliance with UX-001: Create/Edit uses Main/Aside layout (3-col grid), all fields inside Sections/Cards
(no naked inputs), View pages use Infolists (not disabled edit forms), status badges use BADGE-001, empty states have a specific
title + explanation + exactly 1 CTA, and tables provide search/sort/filters for core dimensions.
If UX-001 is not fully satisfied, the spec MUST include an explicit exemption with documented rationale.

### Functional Requirements

- **FR-001 (Finding source contract)**: Findings created by baseline compare MUST be identifiable as baseline compare findings via a stable source identifier (`source = baseline.compare`).

- **FR-002 (Fully successful guardrail)**: Auto-close MUST run only after a fully successful baseline compare run. “Fully successful” means all of the following:
  - The compare run outcome is `succeeded`.
  - The compare run emits Ops-UX canonical completeness counters in `OperationRun.summary_counts`, and they indicate:
    - `summary_counts.failed == 0`
    - `summary_counts.processed == summary_counts.total`
  - Compare preconditions (e.g., missing active baseline snapshot, missing assignment) are enforced before enqueue and MUST prevent a compare run from being created; therefore auto-close cannot run when preconditions fail.
  - **Implementation note (required invariant)**: precondition failures are returned as stable reason codes and MUST result in **no `OperationRun` being created**.

- **FR-003 (Safe auto-close behavior)**: After a fully successful compare, the system MUST resolve open baseline compare findings that are not present in the current run’s seen set.
- **FR-004 (No partial resolution)**: The system MUST NOT resolve findings for any items that were not evaluated in the run.
- **FR-005 (Resolution reason)**: Auto-resolved findings MUST record resolution reason `no_longer_drifting`.

- **FR-006 (Reopen semantics)**: If a previously resolved baseline compare finding reappears in a later compare, it MUST transition to an actionable open state (e.g., `reopened`) and be treated as “new actionable work” for alerting.
  - The required open state for reappearance is `reopened`.

- **FR-007 (Alert event type: baseline drift)**: The system MUST support an alert event type `baseline_high_drift` for baseline compare findings.
- **FR-008 (Alert producer: baseline drift)**: Baseline drift alert events MUST be produced only for baseline compare findings that are actionable (open states) AND are either newly created or newly reopened within the evaluation window.
- **FR-009 (Baseline drift deduplication)**: Baseline drift alert events MUST be deduplicated by a stable key derived from the finding fingerprint. The same open finding MUST NOT emit repeated events on subsequent compares.

- **FR-010 (Alert event type: compare failed)**: The system MUST support an alert event type `baseline_compare_failed`.
- **FR-011 (Alert producer: compare failed)**: A `baseline_compare_failed` event MUST be produced when a baseline compare run completes with outcome failed or `partially_succeeded`.
- **FR-012 (Compare-failed dedup + cooldown)**: Compare-failed events MUST be deduplicated per run identity and MUST respect existing cooldown/quiet-hours behavior.
  - This feature MUST NOT introduce a baseline-specific cooldown interval; it reuses the existing dispatcher cooldown behavior.

- **FR-013 (Canonical run types)**: Baseline capture and baseline compare MUST use centrally defined canonical run types (`baseline_capture`, `baseline_compare`) and MUST NOT rely on ad-hoc string literals.

- **FR-014 (Workspace settings: severity mapping)**: The system MUST support a workspace setting `baseline.severity_mapping` that maps baseline drift categories to severity.
- **FR-015 (Workspace settings: validation)**: The severity mapping MUST:
  - accept only the baseline drift `change_type` keys `missing_policy`, `different_version`, and `unexpected_policy` (no other keys),
  - reject invalid severity values,
  - and expose “effective value” behavior (system defaults + workspace overrides).
- **FR-016 (Workspace settings: alert threshold)**: The system MUST support a workspace setting `baseline.alert_min_severity` with allowed values low/medium/high/critical and default high.
  - Severity threshold comparison MUST use the canonical severity ordering: `low < medium < high < critical` (inclusive).
- **FR-017 (Workspace settings: auto-close toggle)**: The system MUST support a workspace setting `baseline.auto_close_enabled` defaulting to true.
  - When set to `false`, auto-close MUST be skipped even if the compare is fully successful.

- **FR-018 (Information architecture / ownership)**: Baseline Profile CRUD MUST remain workspace-owned. It MUST NOT appear tenant-scoped, must not show tenant scope banners, and must not be reachable from tenant-only navigation.

#### Assumptions & Dependencies

- Baseline compare already produces stable finding fingerprints and a per-run Ops-UX `summary_counts` payload that can express completeness (`processed` vs `total`) and failures (`failed`).
- Findings support lifecycle transitions including resolve with a reason and reopen semantics for a recurring fingerprint.
- Alert dispatch already supports deduplication, cooldown, and quiet hours; this feature reuses that behavior for new baseline-specific event types.

#### Constitution Alignment Notes (non-functional but mandatory)

- This feature adds no new Microsoft Graph calls.
- Baseline compare and alert evaluation are long-running operations; any new auto-close and alert integration MUST preserve tenant isolation and run observability.

- **Ops-UX (3-surface feedback)**: baseline compare/capture and alerts evaluation must continue to provide:
  - an intent-only toast on start,
  - progress surfaces (Operations pages),
  - and terminal DB notifications where applicable.

- **Operation run ownership**: Operation status/outcome transitions are owned by the operations subsystem and must not be mutated directly by UI code.
- **Summary counts contract**: Any summary counters produced/updated by this feature MUST use the canonical summary key registry and numeric-only values.
- **Scheduled/system runs**: Runs initiated without a human initiator MUST not produce terminal DB notifications; monitoring remains via Operations/Alerts.

- **RBAC-UX**: Authorization planes involved:
  - Workspace management plane (admin, workspace-owned baselines + workspace settings)
  - Tenant-context plane (baseline compare monitoring)
  Cross-plane access MUST be deny-as-not-found (404).
  - Non-member or not entitled to the workspace scope or tenant scope → 404 (deny-as-not-found)
  - Member but missing capability for the surface → 403

- **BADGE-001**: Any new baseline severity mapping must remain centralized (single mapping source) and covered by tests.

## UI Action Matrix *(mandatory when Filament is changed)*

| Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions |
|---|---|---|---|---|---|---|---|---|---|---|
| Resource | Admin → Governance → Baselines (workspace management) | Create baseline profile (capability-gated) | View action | Edit (capability-gated), Delete/Archive (if present) | None | Create baseline profile | Capture / Compare shortcuts (if present), Edit | Save + Cancel | Yes | Workspace-owned; MUST NOT show tenant scope banner; must not appear in tenant nav. |
| Page | Admin → Settings → Workspace settings | Save (capability-gated) | N/A | N/A | N/A | N/A | N/A | Save + Cancel (or equivalent) | Yes | Adds baseline settings fields; validation must reject malformed mapping. |
| Page | Admin → Tenant context → Governance → Baseline Compare | Compare now (capability-gated) | Link to findings / operation run details | N/A | N/A | N/A | N/A | N/A | Yes | Tenant-context monitoring surface; must not expose workspace management actions. |

### Key Entities *(include if feature involves data)*

- **Baseline Compare Finding**: A finding produced by a baseline compare run, identified by `source = baseline.compare` and a stable fingerprint.
- **Baseline Compare Run**: A run that evaluates tenant configuration against a baseline profile and produces a compare summary that can indicate completeness.
- **Alert Event**: A deduplicated, rule-dispatchable representation of actionable baseline drift or baseline compare failure.
- **Workspace Baseline Settings**: Workspace-specific overrides for severity mapping, alert threshold, and auto-close enablement.

## Success Criteria *(mandatory)*

### Measurable Outcomes

- **SC-001 (Noise reduction)**: In a controlled test scenario where drift disappears, 100% of baseline drift findings created by baseline compare auto-resolve after the first fully successful compare.
- **SC-002 (Safety)**: In scenarios where compare is failed/`partially_succeeded`/incomplete, 0 baseline findings are auto-resolved.
- **SC-003 (Alert dedupe)**: The same open baseline drift finding does not generate more than 1 `baseline_high_drift` alert event per open/reopen cycle.
- **SC-004 (Timeliness)**: Baseline compare failures generate a `baseline_compare_failed` alert event within the next alert evaluation cycle.
- **SC-005 (Configurability)**: A workspace admin can change baseline severity mapping and minimum alert severity in under 2 minutes, and newly generated findings reflect the change.