Ahmed Darrazi 61c4432de1 feat(115): baseline operability + alerts

2026-03-01 03:23:39 +01:00

20 KiB

Raw Blame History

Feature Specification: Baseline Operability & Alert Integration (R1.1–R1.4 Extension)

Feature Branch: 115-baseline-operability-alerts
Created: 2026-02-28
Status: Ready (Design complete; implementation pending)
Input: User description: "115 — Baseline Operability & Alert Integration (R1.1–R1.4 Extension)"

Spec Scope Fields (mandatory)

Scope: workspace (management) + tenant (monitoring)
Primary Routes:
- Workspace (admin): Baselines management (Baseline Profiles) and Workspace Settings
- Tenant-context (admin): Baseline Compare monitoring and Baseline Compare “run now” surface
Data Ownership:
- Workspace-owned: Baseline profiles, baseline-to-tenant assignments, workspace settings, alert rules
- Tenant-scoped (within a workspace): Findings produced by baseline compare; operation runs for tenant compares
RBAC:
- Workspace Baselines (view/manage): workspace members must be granted the workspace baselines view/manage capabilities (from the canonical capability registry)
- Workspace Settings (view/manage): workspace members must be granted the workspace settings view/manage capabilities (from the canonical capability registry)
- Alerts (view/manage): workspace members must be granted the alerts view/manage capabilities (from the canonical capability registry)
- Tenant monitoring surfaces require tenant access (tenant view) in addition to workspace membership

For canonical-view specs: not applicable (this is not a canonical-view feature).

Clarifications

Session 2026-02-28

Q: What should baseline.severity_mapping map from? → A: Baseline drift change_type only (keys: missing_policy, different_version, unexpected_policy).
Q: What is the canonical “fully successful compare” gate for auto-close? → A: Outcome succeeded AND Ops-UX canonical OperationRun.summary_counts gate: summary_counts.processed == summary_counts.total AND summary_counts.failed == 0.
Q: When a previously resolved baseline finding reappears, what status should it transition to? → A: reopened.
Q: For baseline_compare_failed alerts, what cooldown behavior applies? → A: Use the existing dispatcher cooldown (no baseline-specific cooldown setting).
Q: Should baseline.auto_close_enabled exist as a kill-switch? → A: Yes; keep it with default true.

Definitions

Baseline finding: A drift finding produced by comparing a tenant against a baseline.
Fingerprint: A stable identifier for “the same underlying issue” across runs, used for idempotency and alert deduplication.
Fully successful compare: A compare run that succeeded and is complete (no failed items and all expected items processed).
- Completeness is proven via Ops-UX canonical OperationRun.summary_counts counters: summary_counts.processed == summary_counts.total and summary_counts.failed == 0.

User Scenarios & Testing (mandatory)

User Story 1 - Safe auto-close removes stale baseline drift (Priority: P1)

As an MSP operator, I want baseline drift findings to automatically resolve when drift is no longer present, so the findings list remains actionable and doesn’t accumulate noise.

Why this priority: This is the main “operability” gap; without auto-close, drift remediation cannot be reliably observed, and alerting becomes noisy.

Independent Test: Can be fully tested by running a baseline compare that produces a “seen set” of fingerprints and verifying that previously-open baseline findings not present in the seen set are resolved only when the compare is fully successful.

Acceptance Scenarios:

Given an open baseline finding with source = baseline.compare, When a baseline compare completes fully successfully and that finding’s fingerprint is not present in the current compare result, Then the finding becomes resolved with reason no_longer_drifting.
Given an open baseline finding with source = baseline.compare, When a baseline compare is partially_succeeded or failed (or incomplete), Then no baseline findings are auto-resolved.

User Story 2 - Baseline alerts are precise and deduplicated (Priority: P1)

As an on-call operator, I want alerts for baseline drift and baseline compare failures to trigger only when there is new actionable work (new or reopened findings) or when a compare fails, so I don’t get spammed on every run.

Why this priority: MSP sellability depends on trust in alerts; repeated “same problem” alerts make alerting unusable.

Independent Test: Can be fully tested by creating findings with controlled timestamps and statuses (new/reopened/open) and verifying that only new/reopened findings generate baseline drift alert events, while repeated compares do not.

Acceptance Scenarios:

Given a new high-severity baseline drift finding (deduped by fingerprint) created after the alert window start, When alerts are evaluated, Then a baseline_high_drift event is produced exactly once.
Given a resolved baseline drift finding that later reappears and transitions to reopened, When alerts are evaluated, Then a baseline_high_drift event is produced again exactly once.
Given a baseline compare run that completes with outcome failed or partially_succeeded, When alerts are evaluated, Then a baseline_compare_failed event is produced (deduped by run identity + cooldown).

User Story 3 - Workspace-controlled severity mapping and alert threshold (Priority: P2)

As a workspace admin, I want to configure how baseline drift categories map to severity, and optionally the minimum severity that triggers baseline drift alerts, so the system matches the MSP’s operational standards.

Why this priority: Settings are required for enterprise adoption; hardcoded severity and alert thresholds don’t fit different environments.

Independent Test: Can be fully tested by setting workspace overrides and verifying that newly created baseline findings inherit the configured severity and that alert generation respects the configured minimum severity.

Acceptance Scenarios:

Given a workspace-level baseline severity mapping override, When a new baseline drift finding is created, Then it uses the mapped severity (and rejects invalid severity values).
Given a workspace-level baseline alert minimum severity override, When baseline findings are evaluated for alerts, Then only findings meeting or exceeding that threshold emit baseline_high_drift events.

Edge Cases

Compare completes but is not “fully successful” (e.g., summary_counts.failed > 0, incomplete processing where summary_counts.processed != summary_counts.total, or compare preconditions prevent the run from being created): auto-close MUST NOT occur.
Compare does not evaluate all assigned items (e.g., missing baseline snapshot or assignment changes mid-run): auto-close MUST NOT resolve findings for items not evaluated.
A baseline finding was resolved previously and reappears later: it must transition to an actionable open state (e.g., reopened) and be eligible for alerting once.
Workspace settings payload is malformed (unknown drift categories or invalid severity values): save MUST be rejected and effective values MUST remain unchanged.

Requirements (mandatory)

Constitution alignment (required): If this feature introduces any Microsoft Graph calls, any write/change behavior, or any long-running/queued/scheduled work, the spec MUST describe contract registry updates, safety gates (preview/confirmation/audit), tenant isolation, run observability (OperationRun type/identity/visibility), and tests. If security-relevant DB-only actions intentionally skip OperationRun, the spec MUST describe AuditLog entries.

Constitution alignment (OPS-UX): If this feature creates/reuses an OperationRun, the spec MUST:

explicitly state compliance with the Ops-UX 3-surface feedback contract (toast intent-only, progress surfaces, terminal DB notification),
state that OperationRun.status / OperationRun.outcome transitions are service-owned (only via OperationRunService),
describe how summary_counts keys/values comply with OperationSummaryKeys::all() and numeric-only rules,
clarify scheduled/system-run behavior (initiator null → no terminal DB notification; audit is via Monitoring),
list which regression guard tests are added/updated to keep these rules enforceable in CI.

Constitution alignment (RBAC-UX): If this feature introduces or changes authorization behavior, the spec MUST:

state which authorization plane(s) are involved (tenant/admin /admin + tenant-context /admin/t/{tenant}/... vs platform /system),
ensure any cross-plane access is deny-as-not-found (404),
explicitly define 404 vs 403 semantics:
- non-member / not entitled to workspace scope OR tenant scope → 404 (deny-as-not-found)
- member but missing capability → 403
describe how authorization is enforced server-side (Gates/Policies) for every mutation/operation-start/credential change,
reference the canonical capability registry (no raw capability strings; no role-string checks in feature code),
ensure global search is tenant-scoped and non-member-safe (no hints; inaccessible results treated as 404 semantics),
ensure destructive-like actions require confirmation (->requiresConfirmation()),
include at least one positive and one negative authorization test, and note any RBAC regression tests added/updated.

Constitution alignment (OPS-EX-AUTH-001): OIDC/SAML login handshakes may perform synchronous outbound HTTP (e.g., token exchange) on /auth/* endpoints without an OperationRun. This MUST NOT be used for Monitoring/Operations pages.

Constitution alignment (BADGE-001): If this feature changes status-like badges (status/outcome/severity/risk/availability/boolean), the spec MUST describe how badge semantics stay centralized (no ad-hoc mappings) and which tests cover any new/changed values.

Constitution alignment (Filament Action Surfaces): If this feature adds or modifies any Filament Resource / RelationManager / Page, the spec MUST include a “UI Action Matrix” (see below) and explicitly state whether the Action Surface Contract is satisfied. If the contract is not satisfied, the spec MUST include an explicit exemption with rationale. Constitution alignment (UX-001 — Layout & Information Architecture): If this feature adds or modifies any Filament screen, the spec MUST describe compliance with UX-001: Create/Edit uses Main/Aside layout (3-col grid), all fields inside Sections/Cards (no naked inputs), View pages use Infolists (not disabled edit forms), status badges use BADGE-001, empty states have a specific title + explanation + exactly 1 CTA, and tables provide search/sort/filters for core dimensions. If UX-001 is not fully satisfied, the spec MUST include an explicit exemption with documented rationale.

Functional Requirements

FR-001 (Finding source contract): Findings created by baseline compare MUST be identifiable as baseline compare findings via a stable source identifier (source = baseline.compare).
FR-002 (Fully successful guardrail): Auto-close MUST run only after a fully successful baseline compare run. “Fully successful” means all of the following:
- The compare run outcome is succeeded.
- The compare run emits Ops-UX canonical completeness counters in OperationRun.summary_counts, and they indicate:
  - summary_counts.failed == 0
  - summary_counts.processed == summary_counts.total
- Compare preconditions (e.g., missing active baseline snapshot, missing assignment) are enforced before enqueue and MUST prevent a compare run from being created; therefore auto-close cannot run when preconditions fail.
- Implementation note (required invariant): precondition failures are returned as stable reason codes and MUST result in no OperationRun being created.
FR-003 (Safe auto-close behavior): After a fully successful compare, the system MUST resolve open baseline compare findings that are not present in the current run’s seen set.
FR-004 (No partial resolution): The system MUST NOT resolve findings for any items that were not evaluated in the run.
FR-005 (Resolution reason): Auto-resolved findings MUST record resolution reason no_longer_drifting.
FR-006 (Reopen semantics): If a previously resolved baseline compare finding reappears in a later compare, it MUST transition to an actionable open state (e.g., reopened) and be treated as “new actionable work” for alerting.
- The required open state for reappearance is reopened.
FR-007 (Alert event type: baseline drift): The system MUST support an alert event type baseline_high_drift for baseline compare findings.
FR-008 (Alert producer: baseline drift): Baseline drift alert events MUST be produced only for baseline compare findings that are actionable (open states) AND are either newly created or newly reopened within the evaluation window.
FR-009 (Baseline drift deduplication): Baseline drift alert events MUST be deduplicated by a stable key derived from the finding fingerprint. The same open finding MUST NOT emit repeated events on subsequent compares.
FR-010 (Alert event type: compare failed): The system MUST support an alert event type baseline_compare_failed.
FR-011 (Alert producer: compare failed): A baseline_compare_failed event MUST be produced when a baseline compare run completes with outcome failed or partially_succeeded.
FR-012 (Compare-failed dedup + cooldown): Compare-failed events MUST be deduplicated per run identity and MUST respect existing cooldown/quiet-hours behavior.
- This feature MUST NOT introduce a baseline-specific cooldown interval; it reuses the existing dispatcher cooldown behavior.
FR-013 (Canonical run types): Baseline capture and baseline compare MUST use centrally defined canonical run types (baseline_capture, baseline_compare) and MUST NOT rely on ad-hoc string literals.
FR-014 (Workspace settings: severity mapping): The system MUST support a workspace setting baseline.severity_mapping that maps baseline drift categories to severity.
FR-015 (Workspace settings: validation): The severity mapping MUST:
- accept only the baseline drift change_type keys missing_policy, different_version, and unexpected_policy (no other keys),
- reject invalid severity values,
- and expose “effective value” behavior (system defaults + workspace overrides).
FR-016 (Workspace settings: alert threshold): The system MUST support a workspace setting baseline.alert_min_severity with allowed values low/medium/high/critical and default high.
- Severity threshold comparison MUST use the canonical severity ordering: low < medium < high < critical (inclusive).
FR-017 (Workspace settings: auto-close toggle): The system MUST support a workspace setting baseline.auto_close_enabled defaulting to true.
- When set to false, auto-close MUST be skipped even if the compare is fully successful.
FR-018 (Information architecture / ownership): Baseline Profile CRUD MUST remain workspace-owned. It MUST NOT appear tenant-scoped, must not show tenant scope banners, and must not be reachable from tenant-only navigation.

Assumptions & Dependencies

Baseline compare already produces stable finding fingerprints and a per-run Ops-UX summary_counts payload that can express completeness (processed vs total) and failures (failed).
Findings support lifecycle transitions including resolve with a reason and reopen semantics for a recurring fingerprint.
Alert dispatch already supports deduplication, cooldown, and quiet hours; this feature reuses that behavior for new baseline-specific event types.

Constitution Alignment Notes (non-functional but mandatory)

This feature adds no new Microsoft Graph calls.
Baseline compare and alert evaluation are long-running operations; any new auto-close and alert integration MUST preserve tenant isolation and run observability.
Ops-UX (3-surface feedback): baseline compare/capture and alerts evaluation must continue to provide:
- an intent-only toast on start,
- progress surfaces (Operations pages),
- and terminal DB notifications where applicable.
Operation run ownership: Operation status/outcome transitions are owned by the operations subsystem and must not be mutated directly by UI code.
Summary counts contract: Any summary counters produced/updated by this feature MUST use the canonical summary key registry and numeric-only values.
Scheduled/system runs: Runs initiated without a human initiator MUST not produce terminal DB notifications; monitoring remains via Operations/Alerts.
RBAC-UX: Authorization planes involved:
- Workspace management plane (admin, workspace-owned baselines + workspace settings)
- Tenant-context plane (baseline compare monitoring) Cross-plane access MUST be deny-as-not-found (404).
- Non-member or not entitled to the workspace scope or tenant scope → 404 (deny-as-not-found)
- Member but missing capability for the surface → 403
BADGE-001: Any new baseline severity mapping must remain centralized (single mapping source) and covered by tests.

UI Action Matrix (mandatory when Filament is changed)

Surface	Location	Header Actions	Inspect Affordance (List/Table)	Row Actions (max 2 visible)	Bulk Actions (grouped)	Empty-State CTA(s)	View Header Actions	Create/Edit Save+Cancel	Audit log?	Notes / Exemptions
Resource	Admin → Governance → Baselines (workspace management)	Create baseline profile (capability-gated)	View action	Edit (capability-gated), Delete/Archive (if present)	None	Create baseline profile	Capture / Compare shortcuts (if present), Edit	Save + Cancel	Yes	Workspace-owned; MUST NOT show tenant scope banner; must not appear in tenant nav.
Page	Admin → Settings → Workspace settings	Save (capability-gated)	N/A	N/A	N/A	N/A	N/A	Save + Cancel (or equivalent)	Yes	Adds baseline settings fields; validation must reject malformed mapping.
Page	Admin → Tenant context → Governance → Baseline Compare	Compare now (capability-gated)	Link to findings / operation run details	N/A	N/A	N/A	N/A	N/A	Yes	Tenant-context monitoring surface; must not expose workspace management actions.

Key Entities (include if feature involves data)

Baseline Compare Finding: A finding produced by a baseline compare run, identified by source = baseline.compare and a stable fingerprint.
Baseline Compare Run: A run that evaluates tenant configuration against a baseline profile and produces a compare summary that can indicate completeness.
Alert Event: A deduplicated, rule-dispatchable representation of actionable baseline drift or baseline compare failure.
Workspace Baseline Settings: Workspace-specific overrides for severity mapping, alert threshold, and auto-close enablement.

Success Criteria (mandatory)

Measurable Outcomes

SC-001 (Noise reduction): In a controlled test scenario where drift disappears, 100% of baseline drift findings created by baseline compare auto-resolve after the first fully successful compare.
SC-002 (Safety): In scenarios where compare is failed/partially_succeeded/incomplete, 0 baseline findings are auto-resolved.
SC-003 (Alert dedupe): The same open baseline drift finding does not generate more than 1 baseline_high_drift alert event per open/reopen cycle.
SC-004 (Timeliness): Baseline compare failures generate a baseline_compare_failed alert event within the next alert evaluation cycle.
SC-005 (Configurability): A workspace admin can change baseline severity mapping and minimum alert severity in under 2 minutes, and newly generated findings reflect the change.

20 KiB Raw Blame History Unescape Escape

Feature Specification: Baseline Operability & Alert Integration (R1.1–R1.4 Extension)

Spec Scope Fields (mandatory)

Clarifications

Session 2026-02-28

Definitions

User Scenarios & Testing (mandatory)

User Story 1 - Safe auto-close removes stale baseline drift (Priority: P1)

User Story 2 - Baseline alerts are precise and deduplicated (Priority: P1)

User Story 3 - Workspace-controlled severity mapping and alert threshold (Priority: P2)

Edge Cases

Requirements (mandatory)

Functional Requirements

Assumptions & Dependencies

Constitution Alignment Notes (non-functional but mandatory)

UI Action Matrix (mandatory when Filament is changed)

Key Entities (include if feature involves data)

Success Criteria (mandatory)

Measurable Outcomes

20 KiB

Raw Blame History