20 KiB
Feature Specification: Baseline Operability & Alert Integration (R1.1–R1.4 Extension)
Feature Branch: 115-baseline-operability-alerts
Created: 2026-02-28
Status: Ready (Design complete; implementation pending)
Input: User description: "115 — Baseline Operability & Alert Integration (R1.1–R1.4 Extension)"
Spec Scope Fields (mandatory)
- Scope: workspace (management) + tenant (monitoring)
- Primary Routes:
- Workspace (admin): Baselines management (Baseline Profiles) and Workspace Settings
- Tenant-context (admin): Baseline Compare monitoring and Baseline Compare “run now” surface
- Data Ownership:
- Workspace-owned: Baseline profiles, baseline-to-tenant assignments, workspace settings, alert rules
- Tenant-scoped (within a workspace): Findings produced by baseline compare; operation runs for tenant compares
- RBAC:
- Workspace Baselines (view/manage): workspace members must be granted the workspace baselines view/manage capabilities (from the canonical capability registry)
- Workspace Settings (view/manage): workspace members must be granted the workspace settings view/manage capabilities (from the canonical capability registry)
- Alerts (view/manage): workspace members must be granted the alerts view/manage capabilities (from the canonical capability registry)
- Tenant monitoring surfaces require tenant access (tenant view) in addition to workspace membership
For canonical-view specs: not applicable (this is not a canonical-view feature).
Clarifications
Session 2026-02-28
- Q: What should
baseline.severity_mappingmap from? → A: Baseline driftchange_typeonly (keys:missing_policy,different_version,unexpected_policy). - Q: What is the canonical “fully successful compare” gate for auto-close? → A: Outcome
succeededAND Ops-UX canonicalOperationRun.summary_countsgate:summary_counts.processed == summary_counts.totalANDsummary_counts.failed == 0. - Q: When a previously resolved baseline finding reappears, what status should it transition to? → A:
reopened. - Q: For
baseline_compare_failedalerts, what cooldown behavior applies? → A: Use the existing dispatcher cooldown (no baseline-specific cooldown setting). - Q: Should
baseline.auto_close_enabledexist as a kill-switch? → A: Yes; keep it with defaulttrue.
Definitions
- Baseline finding: A drift finding produced by comparing a tenant against a baseline.
- Fingerprint: A stable identifier for “the same underlying issue” across runs, used for idempotency and alert deduplication.
- Fully successful compare: A compare run that succeeded and is complete (no failed items and all expected items processed).
- Completeness is proven via Ops-UX canonical
OperationRun.summary_countscounters:summary_counts.processed == summary_counts.totalandsummary_counts.failed == 0.
- Completeness is proven via Ops-UX canonical
User Scenarios & Testing (mandatory)
User Story 1 - Safe auto-close removes stale baseline drift (Priority: P1)
As an MSP operator, I want baseline drift findings to automatically resolve when drift is no longer present, so the findings list remains actionable and doesn’t accumulate noise.
Why this priority: This is the main “operability” gap; without auto-close, drift remediation cannot be reliably observed, and alerting becomes noisy.
Independent Test: Can be fully tested by running a baseline compare that produces a “seen set” of fingerprints and verifying that previously-open baseline findings not present in the seen set are resolved only when the compare is fully successful.
Acceptance Scenarios:
- Given an open baseline finding with
source = baseline.compare, When a baseline compare completes fully successfully and that finding’s fingerprint is not present in the current compare result, Then the finding becomesresolvedwith reasonno_longer_drifting. - Given an open baseline finding with
source = baseline.compare, When a baseline compare ispartially_succeededor failed (or incomplete), Then no baseline findings are auto-resolved.
User Story 2 - Baseline alerts are precise and deduplicated (Priority: P1)
As an on-call operator, I want alerts for baseline drift and baseline compare failures to trigger only when there is new actionable work (new or reopened findings) or when a compare fails, so I don’t get spammed on every run.
Why this priority: MSP sellability depends on trust in alerts; repeated “same problem” alerts make alerting unusable.
Independent Test: Can be fully tested by creating findings with controlled timestamps and statuses (new/reopened/open) and verifying that only new/reopened findings generate baseline drift alert events, while repeated compares do not.
Acceptance Scenarios:
- Given a new high-severity baseline drift finding (deduped by fingerprint) created after the alert window start, When alerts are evaluated, Then a
baseline_high_driftevent is produced exactly once. - Given a resolved baseline drift finding that later reappears and transitions to
reopened, When alerts are evaluated, Then abaseline_high_driftevent is produced again exactly once. - Given a baseline compare run that completes with outcome failed or
partially_succeeded, When alerts are evaluated, Then abaseline_compare_failedevent is produced (deduped by run identity + cooldown).
User Story 3 - Workspace-controlled severity mapping and alert threshold (Priority: P2)
As a workspace admin, I want to configure how baseline drift categories map to severity, and optionally the minimum severity that triggers baseline drift alerts, so the system matches the MSP’s operational standards.
Why this priority: Settings are required for enterprise adoption; hardcoded severity and alert thresholds don’t fit different environments.
Independent Test: Can be fully tested by setting workspace overrides and verifying that newly created baseline findings inherit the configured severity and that alert generation respects the configured minimum severity.
Acceptance Scenarios:
- Given a workspace-level baseline severity mapping override, When a new baseline drift finding is created, Then it uses the mapped severity (and rejects invalid severity values).
- Given a workspace-level baseline alert minimum severity override, When baseline findings are evaluated for alerts, Then only findings meeting or exceeding that threshold emit
baseline_high_driftevents.
Edge Cases
- Compare completes but is not “fully successful” (e.g.,
summary_counts.failed > 0, incomplete processing wheresummary_counts.processed != summary_counts.total, or compare preconditions prevent the run from being created): auto-close MUST NOT occur. - Compare does not evaluate all assigned items (e.g., missing baseline snapshot or assignment changes mid-run): auto-close MUST NOT resolve findings for items not evaluated.
- A baseline finding was resolved previously and reappears later: it must transition to an actionable open state (e.g.,
reopened) and be eligible for alerting once. - Workspace settings payload is malformed (unknown drift categories or invalid severity values): save MUST be rejected and effective values MUST remain unchanged.
Requirements (mandatory)
Constitution alignment (required): If this feature introduces any Microsoft Graph calls, any write/change behavior,
or any long-running/queued/scheduled work, the spec MUST describe contract registry updates, safety gates
(preview/confirmation/audit), tenant isolation, run observability (OperationRun type/identity/visibility), and tests.
If security-relevant DB-only actions intentionally skip OperationRun, the spec MUST describe AuditLog entries.
Constitution alignment (OPS-UX): If this feature creates/reuses an OperationRun, the spec MUST:
- explicitly state compliance with the Ops-UX 3-surface feedback contract (toast intent-only, progress surfaces, terminal DB notification),
- state that
OperationRun.status/OperationRun.outcometransitions are service-owned (only viaOperationRunService), - describe how
summary_countskeys/values comply withOperationSummaryKeys::all()and numeric-only rules, - clarify scheduled/system-run behavior (initiator null → no terminal DB notification; audit is via Monitoring),
- list which regression guard tests are added/updated to keep these rules enforceable in CI.
Constitution alignment (RBAC-UX): If this feature introduces or changes authorization behavior, the spec MUST:
- state which authorization plane(s) are involved (tenant/admin
/admin+ tenant-context/admin/t/{tenant}/...vs platform/system), - ensure any cross-plane access is deny-as-not-found (404),
- explicitly define 404 vs 403 semantics:
- non-member / not entitled to workspace scope OR tenant scope → 404 (deny-as-not-found)
- member but missing capability → 403
- describe how authorization is enforced server-side (Gates/Policies) for every mutation/operation-start/credential change,
- reference the canonical capability registry (no raw capability strings; no role-string checks in feature code),
- ensure global search is tenant-scoped and non-member-safe (no hints; inaccessible results treated as 404 semantics),
- ensure destructive-like actions require confirmation (
->requiresConfirmation()), - include at least one positive and one negative authorization test, and note any RBAC regression tests added/updated.
Constitution alignment (OPS-EX-AUTH-001): OIDC/SAML login handshakes may perform synchronous outbound HTTP (e.g., token exchange)
on /auth/* endpoints without an OperationRun. This MUST NOT be used for Monitoring/Operations pages.
Constitution alignment (BADGE-001): If this feature changes status-like badges (status/outcome/severity/risk/availability/boolean), the spec MUST describe how badge semantics stay centralized (no ad-hoc mappings) and which tests cover any new/changed values.
Constitution alignment (Filament Action Surfaces): If this feature adds or modifies any Filament Resource / RelationManager / Page, the spec MUST include a “UI Action Matrix” (see below) and explicitly state whether the Action Surface Contract is satisfied. If the contract is not satisfied, the spec MUST include an explicit exemption with rationale. Constitution alignment (UX-001 — Layout & Information Architecture): If this feature adds or modifies any Filament screen, the spec MUST describe compliance with UX-001: Create/Edit uses Main/Aside layout (3-col grid), all fields inside Sections/Cards (no naked inputs), View pages use Infolists (not disabled edit forms), status badges use BADGE-001, empty states have a specific title + explanation + exactly 1 CTA, and tables provide search/sort/filters for core dimensions. If UX-001 is not fully satisfied, the spec MUST include an explicit exemption with documented rationale.
Functional Requirements
-
FR-001 (Finding source contract): Findings created by baseline compare MUST be identifiable as baseline compare findings via a stable source identifier (
source = baseline.compare). -
FR-002 (Fully successful guardrail): Auto-close MUST run only after a fully successful baseline compare run. “Fully successful” means all of the following:
- The compare run outcome is
succeeded. - The compare run emits Ops-UX canonical completeness counters in
OperationRun.summary_counts, and they indicate:summary_counts.failed == 0summary_counts.processed == summary_counts.total
- Compare preconditions (e.g., missing active baseline snapshot, missing assignment) are enforced before enqueue and MUST prevent a compare run from being created; therefore auto-close cannot run when preconditions fail.
- Implementation note (required invariant): precondition failures are returned as stable reason codes and MUST result in no
OperationRunbeing created.
- The compare run outcome is
-
FR-003 (Safe auto-close behavior): After a fully successful compare, the system MUST resolve open baseline compare findings that are not present in the current run’s seen set.
-
FR-004 (No partial resolution): The system MUST NOT resolve findings for any items that were not evaluated in the run.
-
FR-005 (Resolution reason): Auto-resolved findings MUST record resolution reason
no_longer_drifting. -
FR-006 (Reopen semantics): If a previously resolved baseline compare finding reappears in a later compare, it MUST transition to an actionable open state (e.g.,
reopened) and be treated as “new actionable work” for alerting.- The required open state for reappearance is
reopened.
- The required open state for reappearance is
-
FR-007 (Alert event type: baseline drift): The system MUST support an alert event type
baseline_high_driftfor baseline compare findings. -
FR-008 (Alert producer: baseline drift): Baseline drift alert events MUST be produced only for baseline compare findings that are actionable (open states) AND are either newly created or newly reopened within the evaluation window.
-
FR-009 (Baseline drift deduplication): Baseline drift alert events MUST be deduplicated by a stable key derived from the finding fingerprint. The same open finding MUST NOT emit repeated events on subsequent compares.
-
FR-010 (Alert event type: compare failed): The system MUST support an alert event type
baseline_compare_failed. -
FR-011 (Alert producer: compare failed): A
baseline_compare_failedevent MUST be produced when a baseline compare run completes with outcome failed orpartially_succeeded. -
FR-012 (Compare-failed dedup + cooldown): Compare-failed events MUST be deduplicated per run identity and MUST respect existing cooldown/quiet-hours behavior.
- This feature MUST NOT introduce a baseline-specific cooldown interval; it reuses the existing dispatcher cooldown behavior.
-
FR-013 (Canonical run types): Baseline capture and baseline compare MUST use centrally defined canonical run types (
baseline_capture,baseline_compare) and MUST NOT rely on ad-hoc string literals. -
FR-014 (Workspace settings: severity mapping): The system MUST support a workspace setting
baseline.severity_mappingthat maps baseline drift categories to severity. -
FR-015 (Workspace settings: validation): The severity mapping MUST:
- accept only the baseline drift
change_typekeysmissing_policy,different_version, andunexpected_policy(no other keys), - reject invalid severity values,
- and expose “effective value” behavior (system defaults + workspace overrides).
- accept only the baseline drift
-
FR-016 (Workspace settings: alert threshold): The system MUST support a workspace setting
baseline.alert_min_severitywith allowed values low/medium/high/critical and default high.- Severity threshold comparison MUST use the canonical severity ordering:
low < medium < high < critical(inclusive).
- Severity threshold comparison MUST use the canonical severity ordering:
-
FR-017 (Workspace settings: auto-close toggle): The system MUST support a workspace setting
baseline.auto_close_enableddefaulting to true.- When set to
false, auto-close MUST be skipped even if the compare is fully successful.
- When set to
-
FR-018 (Information architecture / ownership): Baseline Profile CRUD MUST remain workspace-owned. It MUST NOT appear tenant-scoped, must not show tenant scope banners, and must not be reachable from tenant-only navigation.
Assumptions & Dependencies
- Baseline compare already produces stable finding fingerprints and a per-run Ops-UX
summary_countspayload that can express completeness (processedvstotal) and failures (failed). - Findings support lifecycle transitions including resolve with a reason and reopen semantics for a recurring fingerprint.
- Alert dispatch already supports deduplication, cooldown, and quiet hours; this feature reuses that behavior for new baseline-specific event types.
Constitution Alignment Notes (non-functional but mandatory)
-
This feature adds no new Microsoft Graph calls.
-
Baseline compare and alert evaluation are long-running operations; any new auto-close and alert integration MUST preserve tenant isolation and run observability.
-
Ops-UX (3-surface feedback): baseline compare/capture and alerts evaluation must continue to provide:
- an intent-only toast on start,
- progress surfaces (Operations pages),
- and terminal DB notifications where applicable.
-
Operation run ownership: Operation status/outcome transitions are owned by the operations subsystem and must not be mutated directly by UI code.
-
Summary counts contract: Any summary counters produced/updated by this feature MUST use the canonical summary key registry and numeric-only values.
-
Scheduled/system runs: Runs initiated without a human initiator MUST not produce terminal DB notifications; monitoring remains via Operations/Alerts.
-
RBAC-UX: Authorization planes involved:
- Workspace management plane (admin, workspace-owned baselines + workspace settings)
- Tenant-context plane (baseline compare monitoring) Cross-plane access MUST be deny-as-not-found (404).
- Non-member or not entitled to the workspace scope or tenant scope → 404 (deny-as-not-found)
- Member but missing capability for the surface → 403
-
BADGE-001: Any new baseline severity mapping must remain centralized (single mapping source) and covered by tests.
UI Action Matrix (mandatory when Filament is changed)
| Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions |
|---|---|---|---|---|---|---|---|---|---|---|
| Resource | Admin → Governance → Baselines (workspace management) | Create baseline profile (capability-gated) | View action | Edit (capability-gated), Delete/Archive (if present) | None | Create baseline profile | Capture / Compare shortcuts (if present), Edit | Save + Cancel | Yes | Workspace-owned; MUST NOT show tenant scope banner; must not appear in tenant nav. |
| Page | Admin → Settings → Workspace settings | Save (capability-gated) | N/A | N/A | N/A | N/A | N/A | Save + Cancel (or equivalent) | Yes | Adds baseline settings fields; validation must reject malformed mapping. |
| Page | Admin → Tenant context → Governance → Baseline Compare | Compare now (capability-gated) | Link to findings / operation run details | N/A | N/A | N/A | N/A | N/A | Yes | Tenant-context monitoring surface; must not expose workspace management actions. |
Key Entities (include if feature involves data)
- Baseline Compare Finding: A finding produced by a baseline compare run, identified by
source = baseline.compareand a stable fingerprint. - Baseline Compare Run: A run that evaluates tenant configuration against a baseline profile and produces a compare summary that can indicate completeness.
- Alert Event: A deduplicated, rule-dispatchable representation of actionable baseline drift or baseline compare failure.
- Workspace Baseline Settings: Workspace-specific overrides for severity mapping, alert threshold, and auto-close enablement.
Success Criteria (mandatory)
Measurable Outcomes
- SC-001 (Noise reduction): In a controlled test scenario where drift disappears, 100% of baseline drift findings created by baseline compare auto-resolve after the first fully successful compare.
- SC-002 (Safety): In scenarios where compare is failed/
partially_succeeded/incomplete, 0 baseline findings are auto-resolved. - SC-003 (Alert dedupe): The same open baseline drift finding does not generate more than 1
baseline_high_driftalert event per open/reopen cycle. - SC-004 (Timeliness): Baseline compare failures generate a
baseline_compare_failedalert event within the next alert evaluation cycle. - SC-005 (Configurability): A workspace admin can change baseline severity mapping and minimum alert severity in under 2 minutes, and newly generated findings reflect the change.