# Feature Specification: Baseline Operability & Alert Integration (R1.1–R1.4 Extension) **Feature Branch**: `115-baseline-operability-alerts` **Created**: 2026-02-28 **Status**: Ready (Design complete; implementation pending) **Input**: User description: "115 — Baseline Operability & Alert Integration (R1.1–R1.4 Extension)" ## Spec Scope Fields *(mandatory)* - **Scope**: workspace (management) + tenant (monitoring) - **Primary Routes**: - Workspace (admin): Baselines management (Baseline Profiles) and Workspace Settings - Tenant-context (admin): Baseline Compare monitoring and Baseline Compare “run now” surface - **Data Ownership**: - Workspace-owned: Baseline profiles, baseline-to-tenant assignments, workspace settings, alert rules - Tenant-scoped (within a workspace): Findings produced by baseline compare; operation runs for tenant compares - **RBAC**: - Workspace Baselines (view/manage): workspace members must be granted the workspace baselines view/manage capabilities (from the canonical capability registry) - Workspace Settings (view/manage): workspace members must be granted the workspace settings view/manage capabilities (from the canonical capability registry) - Alerts (view/manage): workspace members must be granted the alerts view/manage capabilities (from the canonical capability registry) - Tenant monitoring surfaces require tenant access (tenant view) in addition to workspace membership For canonical-view specs: not applicable (this is not a canonical-view feature). ## Clarifications ### Session 2026-02-28 - Q: What should `baseline.severity_mapping` map from? → A: Baseline drift `change_type` only (keys: `missing_policy`, `different_version`, `unexpected_policy`). - Q: What is the canonical “fully successful compare” gate for auto-close? → A: Outcome `succeeded` AND Ops-UX canonical `OperationRun.summary_counts` gate: `summary_counts.processed == summary_counts.total` AND `summary_counts.failed == 0`. - Q: When a previously resolved baseline finding reappears, what status should it transition to? → A: `reopened`. - Q: For `baseline_compare_failed` alerts, what cooldown behavior applies? → A: Use the existing dispatcher cooldown (no baseline-specific cooldown setting). - Q: Should `baseline.auto_close_enabled` exist as a kill-switch? → A: Yes; keep it with default `true`. ### Definitions - **Baseline finding**: A drift finding produced by comparing a tenant against a baseline. - **Fingerprint**: A stable identifier for “the same underlying issue” across runs, used for idempotency and alert deduplication. - **Fully successful compare**: A compare run that succeeded and is complete (no failed items and all expected items processed). - Completeness is proven via Ops-UX canonical `OperationRun.summary_counts` counters: `summary_counts.processed == summary_counts.total` and `summary_counts.failed == 0`. ## User Scenarios & Testing *(mandatory)* ### User Story 1 - Safe auto-close removes stale baseline drift (Priority: P1) As an MSP operator, I want baseline drift findings to automatically resolve when drift is no longer present, so the findings list remains actionable and doesn’t accumulate noise. **Why this priority**: This is the main “operability” gap; without auto-close, drift remediation cannot be reliably observed, and alerting becomes noisy. **Independent Test**: Can be fully tested by running a baseline compare that produces a “seen set” of fingerprints and verifying that previously-open baseline findings not present in the seen set are resolved only when the compare is fully successful. **Acceptance Scenarios**: 1. **Given** an open baseline finding with `source = baseline.compare`, **When** a baseline compare completes fully successfully and that finding’s fingerprint is not present in the current compare result, **Then** the finding becomes `resolved` with reason `no_longer_drifting`. 2. **Given** an open baseline finding with `source = baseline.compare`, **When** a baseline compare is `partially_succeeded` or failed (or incomplete), **Then** no baseline findings are auto-resolved. --- ### User Story 2 - Baseline alerts are precise and deduplicated (Priority: P1) As an on-call operator, I want alerts for baseline drift and baseline compare failures to trigger only when there is new actionable work (new or reopened findings) or when a compare fails, so I don’t get spammed on every run. **Why this priority**: MSP sellability depends on trust in alerts; repeated “same problem” alerts make alerting unusable. **Independent Test**: Can be fully tested by creating findings with controlled timestamps and statuses (new/reopened/open) and verifying that only new/reopened findings generate baseline drift alert events, while repeated compares do not. **Acceptance Scenarios**: 1. **Given** a new high-severity baseline drift finding (deduped by fingerprint) created after the alert window start, **When** alerts are evaluated, **Then** a `baseline_high_drift` event is produced exactly once. 2. **Given** a resolved baseline drift finding that later reappears and transitions to `reopened`, **When** alerts are evaluated, **Then** a `baseline_high_drift` event is produced again exactly once. 3. **Given** a baseline compare run that completes with outcome failed or `partially_succeeded`, **When** alerts are evaluated, **Then** a `baseline_compare_failed` event is produced (deduped by run identity + cooldown). --- ### User Story 3 - Workspace-controlled severity mapping and alert threshold (Priority: P2) As a workspace admin, I want to configure how baseline drift categories map to severity, and optionally the minimum severity that triggers baseline drift alerts, so the system matches the MSP’s operational standards. **Why this priority**: Settings are required for enterprise adoption; hardcoded severity and alert thresholds don’t fit different environments. **Independent Test**: Can be fully tested by setting workspace overrides and verifying that newly created baseline findings inherit the configured severity and that alert generation respects the configured minimum severity. **Acceptance Scenarios**: 1. **Given** a workspace-level baseline severity mapping override, **When** a new baseline drift finding is created, **Then** it uses the mapped severity (and rejects invalid severity values). 2. **Given** a workspace-level baseline alert minimum severity override, **When** baseline findings are evaluated for alerts, **Then** only findings meeting or exceeding that threshold emit `baseline_high_drift` events. ### Edge Cases - Compare completes but is not “fully successful” (e.g., `summary_counts.failed > 0`, incomplete processing where `summary_counts.processed != summary_counts.total`, or compare preconditions prevent the run from being created): auto-close MUST NOT occur. - Compare does not evaluate all assigned items (e.g., missing baseline snapshot or assignment changes mid-run): auto-close MUST NOT resolve findings for items not evaluated. - A baseline finding was resolved previously and reappears later: it must transition to an actionable open state (e.g., `reopened`) and be eligible for alerting once. - Workspace settings payload is malformed (unknown drift categories or invalid severity values): save MUST be rejected and effective values MUST remain unchanged. ## Requirements *(mandatory)* **Constitution alignment (required):** If this feature introduces any Microsoft Graph calls, any write/change behavior, or any long-running/queued/scheduled work, the spec MUST describe contract registry updates, safety gates (preview/confirmation/audit), tenant isolation, run observability (`OperationRun` type/identity/visibility), and tests. If security-relevant DB-only actions intentionally skip `OperationRun`, the spec MUST describe `AuditLog` entries. **Constitution alignment (OPS-UX):** If this feature creates/reuses an `OperationRun`, the spec MUST: - explicitly state compliance with the Ops-UX 3-surface feedback contract (toast intent-only, progress surfaces, terminal DB notification), - state that `OperationRun.status` / `OperationRun.outcome` transitions are service-owned (only via `OperationRunService`), - describe how `summary_counts` keys/values comply with `OperationSummaryKeys::all()` and numeric-only rules, - clarify scheduled/system-run behavior (initiator null → no terminal DB notification; audit is via Monitoring), - list which regression guard tests are added/updated to keep these rules enforceable in CI. **Constitution alignment (RBAC-UX):** If this feature introduces or changes authorization behavior, the spec MUST: - state which authorization plane(s) are involved (tenant/admin `/admin` + tenant-context `/admin/t/{tenant}/...` vs platform `/system`), - ensure any cross-plane access is deny-as-not-found (404), - explicitly define 404 vs 403 semantics: - non-member / not entitled to workspace scope OR tenant scope → 404 (deny-as-not-found) - member but missing capability → 403 - describe how authorization is enforced server-side (Gates/Policies) for every mutation/operation-start/credential change, - reference the canonical capability registry (no raw capability strings; no role-string checks in feature code), - ensure global search is tenant-scoped and non-member-safe (no hints; inaccessible results treated as 404 semantics), - ensure destructive-like actions require confirmation (`->requiresConfirmation()`), - include at least one positive and one negative authorization test, and note any RBAC regression tests added/updated. **Constitution alignment (OPS-EX-AUTH-001):** OIDC/SAML login handshakes may perform synchronous outbound HTTP (e.g., token exchange) on `/auth/*` endpoints without an `OperationRun`. This MUST NOT be used for Monitoring/Operations pages. **Constitution alignment (BADGE-001):** If this feature changes status-like badges (status/outcome/severity/risk/availability/boolean), the spec MUST describe how badge semantics stay centralized (no ad-hoc mappings) and which tests cover any new/changed values. **Constitution alignment (Filament Action Surfaces):** If this feature adds or modifies any Filament Resource / RelationManager / Page, the spec MUST include a “UI Action Matrix” (see below) and explicitly state whether the Action Surface Contract is satisfied. If the contract is not satisfied, the spec MUST include an explicit exemption with rationale. **Constitution alignment (UX-001 — Layout & Information Architecture):** If this feature adds or modifies any Filament screen, the spec MUST describe compliance with UX-001: Create/Edit uses Main/Aside layout (3-col grid), all fields inside Sections/Cards (no naked inputs), View pages use Infolists (not disabled edit forms), status badges use BADGE-001, empty states have a specific title + explanation + exactly 1 CTA, and tables provide search/sort/filters for core dimensions. If UX-001 is not fully satisfied, the spec MUST include an explicit exemption with documented rationale. ### Functional Requirements - **FR-001 (Finding source contract)**: Findings created by baseline compare MUST be identifiable as baseline compare findings via a stable source identifier (`source = baseline.compare`). - **FR-002 (Fully successful guardrail)**: Auto-close MUST run only after a fully successful baseline compare run. “Fully successful” means all of the following: - The compare run outcome is `succeeded`. - The compare run emits Ops-UX canonical completeness counters in `OperationRun.summary_counts`, and they indicate: - `summary_counts.failed == 0` - `summary_counts.processed == summary_counts.total` - Compare preconditions (e.g., missing active baseline snapshot, missing assignment) are enforced before enqueue and MUST prevent a compare run from being created; therefore auto-close cannot run when preconditions fail. - **Implementation note (required invariant)**: precondition failures are returned as stable reason codes and MUST result in **no `OperationRun` being created**. - **FR-003 (Safe auto-close behavior)**: After a fully successful compare, the system MUST resolve open baseline compare findings that are not present in the current run’s seen set. - **FR-004 (No partial resolution)**: The system MUST NOT resolve findings for any items that were not evaluated in the run. - **FR-005 (Resolution reason)**: Auto-resolved findings MUST record resolution reason `no_longer_drifting`. - **FR-006 (Reopen semantics)**: If a previously resolved baseline compare finding reappears in a later compare, it MUST transition to an actionable open state (e.g., `reopened`) and be treated as “new actionable work” for alerting. - The required open state for reappearance is `reopened`. - **FR-007 (Alert event type: baseline drift)**: The system MUST support an alert event type `baseline_high_drift` for baseline compare findings. - **FR-008 (Alert producer: baseline drift)**: Baseline drift alert events MUST be produced only for baseline compare findings that are actionable (open states) AND are either newly created or newly reopened within the evaluation window. - **FR-009 (Baseline drift deduplication)**: Baseline drift alert events MUST be deduplicated by a stable key derived from the finding fingerprint. The same open finding MUST NOT emit repeated events on subsequent compares. - **FR-010 (Alert event type: compare failed)**: The system MUST support an alert event type `baseline_compare_failed`. - **FR-011 (Alert producer: compare failed)**: A `baseline_compare_failed` event MUST be produced when a baseline compare run completes with outcome failed or `partially_succeeded`. - **FR-012 (Compare-failed dedup + cooldown)**: Compare-failed events MUST be deduplicated per run identity and MUST respect existing cooldown/quiet-hours behavior. - This feature MUST NOT introduce a baseline-specific cooldown interval; it reuses the existing dispatcher cooldown behavior. - **FR-013 (Canonical run types)**: Baseline capture and baseline compare MUST use centrally defined canonical run types (`baseline_capture`, `baseline_compare`) and MUST NOT rely on ad-hoc string literals. - **FR-014 (Workspace settings: severity mapping)**: The system MUST support a workspace setting `baseline.severity_mapping` that maps baseline drift categories to severity. - **FR-015 (Workspace settings: validation)**: The severity mapping MUST: - accept only the baseline drift `change_type` keys `missing_policy`, `different_version`, and `unexpected_policy` (no other keys), - reject invalid severity values, - and expose “effective value” behavior (system defaults + workspace overrides). - **FR-016 (Workspace settings: alert threshold)**: The system MUST support a workspace setting `baseline.alert_min_severity` with allowed values low/medium/high/critical and default high. - Severity threshold comparison MUST use the canonical severity ordering: `low < medium < high < critical` (inclusive). - **FR-017 (Workspace settings: auto-close toggle)**: The system MUST support a workspace setting `baseline.auto_close_enabled` defaulting to true. - When set to `false`, auto-close MUST be skipped even if the compare is fully successful. - **FR-018 (Information architecture / ownership)**: Baseline Profile CRUD MUST remain workspace-owned. It MUST NOT appear tenant-scoped, must not show tenant scope banners, and must not be reachable from tenant-only navigation. #### Assumptions & Dependencies - Baseline compare already produces stable finding fingerprints and a per-run Ops-UX `summary_counts` payload that can express completeness (`processed` vs `total`) and failures (`failed`). - Findings support lifecycle transitions including resolve with a reason and reopen semantics for a recurring fingerprint. - Alert dispatch already supports deduplication, cooldown, and quiet hours; this feature reuses that behavior for new baseline-specific event types. #### Constitution Alignment Notes (non-functional but mandatory) - This feature adds no new Microsoft Graph calls. - Baseline compare and alert evaluation are long-running operations; any new auto-close and alert integration MUST preserve tenant isolation and run observability. - **Ops-UX (3-surface feedback)**: baseline compare/capture and alerts evaluation must continue to provide: - an intent-only toast on start, - progress surfaces (Operations pages), - and terminal DB notifications where applicable. - **Operation run ownership**: Operation status/outcome transitions are owned by the operations subsystem and must not be mutated directly by UI code. - **Summary counts contract**: Any summary counters produced/updated by this feature MUST use the canonical summary key registry and numeric-only values. - **Scheduled/system runs**: Runs initiated without a human initiator MUST not produce terminal DB notifications; monitoring remains via Operations/Alerts. - **RBAC-UX**: Authorization planes involved: - Workspace management plane (admin, workspace-owned baselines + workspace settings) - Tenant-context plane (baseline compare monitoring) Cross-plane access MUST be deny-as-not-found (404). - Non-member or not entitled to the workspace scope or tenant scope → 404 (deny-as-not-found) - Member but missing capability for the surface → 403 - **BADGE-001**: Any new baseline severity mapping must remain centralized (single mapping source) and covered by tests. ## UI Action Matrix *(mandatory when Filament is changed)* | Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions | |---|---|---|---|---|---|---|---|---|---|---| | Resource | Admin → Governance → Baselines (workspace management) | Create baseline profile (capability-gated) | View action | Edit (capability-gated), Delete/Archive (if present) | None | Create baseline profile | Capture / Compare shortcuts (if present), Edit | Save + Cancel | Yes | Workspace-owned; MUST NOT show tenant scope banner; must not appear in tenant nav. | | Page | Admin → Settings → Workspace settings | Save (capability-gated) | N/A | N/A | N/A | N/A | N/A | Save + Cancel (or equivalent) | Yes | Adds baseline settings fields; validation must reject malformed mapping. | | Page | Admin → Tenant context → Governance → Baseline Compare | Compare now (capability-gated) | Link to findings / operation run details | N/A | N/A | N/A | N/A | N/A | Yes | Tenant-context monitoring surface; must not expose workspace management actions. | ### Key Entities *(include if feature involves data)* - **Baseline Compare Finding**: A finding produced by a baseline compare run, identified by `source = baseline.compare` and a stable fingerprint. - **Baseline Compare Run**: A run that evaluates tenant configuration against a baseline profile and produces a compare summary that can indicate completeness. - **Alert Event**: A deduplicated, rule-dispatchable representation of actionable baseline drift or baseline compare failure. - **Workspace Baseline Settings**: Workspace-specific overrides for severity mapping, alert threshold, and auto-close enablement. ## Success Criteria *(mandatory)* ### Measurable Outcomes - **SC-001 (Noise reduction)**: In a controlled test scenario where drift disappears, 100% of baseline drift findings created by baseline compare auto-resolve after the first fully successful compare. - **SC-002 (Safety)**: In scenarios where compare is failed/`partially_succeeded`/incomplete, 0 baseline findings are auto-resolved. - **SC-003 (Alert dedupe)**: The same open baseline drift finding does not generate more than 1 `baseline_high_drift` alert event per open/reopen cycle. - **SC-004 (Timeliness)**: Baseline compare failures generate a `baseline_compare_failed` alert event within the next alert evaluation cycle. - **SC-005 (Configurability)**: A workspace admin can change baseline severity mapping and minimum alert severity in under 2 minutes, and newly generated findings reflect the change.