Implements Spec 115 (Baseline Operability & Alert Integration). Key changes - Baseline compare: safe auto-close of stale baseline findings (gated on successful/complete compares) - Baseline alerts: `baseline_high_drift` + `baseline_compare_failed` with dedupe/cooldown semantics - Workspace settings: baseline severity mapping + minimum severity threshold + auto-close toggle - Baseline Compare UX: shared stats layer + landing/widget consistency Notes - Livewire v4 / Filament v5 compatible. - Destructive-like actions require confirmation (no new destructive actions added here). Tests - `vendor/bin/sail artisan test --compact tests/Feature/Baselines/ tests/Feature/Alerts/` Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de> Reviewed-on: #140
230 lines
20 KiB
Markdown
230 lines
20 KiB
Markdown
# Feature Specification: Baseline Operability & Alert Integration (R1.1–R1.4 Extension)
|
||
|
||
**Feature Branch**: `115-baseline-operability-alerts`
|
||
**Created**: 2026-02-28
|
||
**Status**: Ready (Design complete; implementation pending)
|
||
**Input**: User description: "115 — Baseline Operability & Alert Integration (R1.1–R1.4 Extension)"
|
||
|
||
## Spec Scope Fields *(mandatory)*
|
||
|
||
- **Scope**: workspace (management) + tenant (monitoring)
|
||
- **Primary Routes**:
|
||
- Workspace (admin): Baselines management (Baseline Profiles) and Workspace Settings
|
||
- Tenant-context (admin): Baseline Compare monitoring and Baseline Compare “run now” surface
|
||
- **Data Ownership**:
|
||
- Workspace-owned: Baseline profiles, baseline-to-tenant assignments, workspace settings, alert rules
|
||
- Tenant-scoped (within a workspace): Findings produced by baseline compare; operation runs for tenant compares
|
||
- **RBAC**:
|
||
- Workspace Baselines (view/manage): workspace members must be granted the workspace baselines view/manage capabilities (from the canonical capability registry)
|
||
- Workspace Settings (view/manage): workspace members must be granted the workspace settings view/manage capabilities (from the canonical capability registry)
|
||
- Alerts (view/manage): workspace members must be granted the alerts view/manage capabilities (from the canonical capability registry)
|
||
- Tenant monitoring surfaces require tenant access (tenant view) in addition to workspace membership
|
||
|
||
For canonical-view specs: not applicable (this is not a canonical-view feature).
|
||
|
||
## Clarifications
|
||
|
||
### Session 2026-02-28
|
||
|
||
- Q: What should `baseline.severity_mapping` map from? → A: Baseline drift `change_type` only (keys: `missing_policy`, `different_version`, `unexpected_policy`).
|
||
- Q: What is the canonical “fully successful compare” gate for auto-close? → A: Outcome `succeeded` AND Ops-UX canonical `OperationRun.summary_counts` gate: `summary_counts.processed == summary_counts.total` AND `summary_counts.failed == 0`.
|
||
- Q: When a previously resolved baseline finding reappears, what status should it transition to? → A: `reopened`.
|
||
- Q: For `baseline_compare_failed` alerts, what cooldown behavior applies? → A: Use the existing dispatcher cooldown (no baseline-specific cooldown setting).
|
||
- Q: Should `baseline.auto_close_enabled` exist as a kill-switch? → A: Yes; keep it with default `true`.
|
||
|
||
### Definitions
|
||
|
||
- **Baseline finding**: A drift finding produced by comparing a tenant against a baseline.
|
||
- **Fingerprint**: A stable identifier for “the same underlying issue” across runs, used for idempotency and alert deduplication.
|
||
- **Fully successful compare**: A compare run that succeeded and is complete (no failed items and all expected items processed).
|
||
- Completeness is proven via Ops-UX canonical `OperationRun.summary_counts` counters: `summary_counts.processed == summary_counts.total` and `summary_counts.failed == 0`.
|
||
|
||
## User Scenarios & Testing *(mandatory)*
|
||
|
||
### User Story 1 - Safe auto-close removes stale baseline drift (Priority: P1)
|
||
|
||
As an MSP operator, I want baseline drift findings to automatically resolve when drift is no longer present, so the findings list remains actionable and doesn’t accumulate noise.
|
||
|
||
**Why this priority**: This is the main “operability” gap; without auto-close, drift remediation cannot be reliably observed, and alerting becomes noisy.
|
||
|
||
**Independent Test**: Can be fully tested by running a baseline compare that produces a “seen set” of fingerprints and verifying that previously-open baseline findings not present in the seen set are resolved only when the compare is fully successful.
|
||
|
||
**Acceptance Scenarios**:
|
||
|
||
1. **Given** an open baseline finding with `source = baseline.compare`, **When** a baseline compare completes fully successfully and that finding’s fingerprint is not present in the current compare result, **Then** the finding becomes `resolved` with reason `no_longer_drifting`.
|
||
2. **Given** an open baseline finding with `source = baseline.compare`, **When** a baseline compare is `partially_succeeded` or failed (or incomplete), **Then** no baseline findings are auto-resolved.
|
||
|
||
---
|
||
|
||
### User Story 2 - Baseline alerts are precise and deduplicated (Priority: P1)
|
||
|
||
As an on-call operator, I want alerts for baseline drift and baseline compare failures to trigger only when there is new actionable work (new or reopened findings) or when a compare fails, so I don’t get spammed on every run.
|
||
|
||
**Why this priority**: MSP sellability depends on trust in alerts; repeated “same problem” alerts make alerting unusable.
|
||
|
||
**Independent Test**: Can be fully tested by creating findings with controlled timestamps and statuses (new/reopened/open) and verifying that only new/reopened findings generate baseline drift alert events, while repeated compares do not.
|
||
|
||
**Acceptance Scenarios**:
|
||
|
||
1. **Given** a new high-severity baseline drift finding (deduped by fingerprint) created after the alert window start, **When** alerts are evaluated, **Then** a `baseline_high_drift` event is produced exactly once.
|
||
2. **Given** a resolved baseline drift finding that later reappears and transitions to `reopened`, **When** alerts are evaluated, **Then** a `baseline_high_drift` event is produced again exactly once.
|
||
3. **Given** a baseline compare run that completes with outcome failed or `partially_succeeded`, **When** alerts are evaluated, **Then** a `baseline_compare_failed` event is produced (deduped by run identity + cooldown).
|
||
|
||
---
|
||
|
||
### User Story 3 - Workspace-controlled severity mapping and alert threshold (Priority: P2)
|
||
|
||
As a workspace admin, I want to configure how baseline drift categories map to severity, and optionally the minimum severity that triggers baseline drift alerts, so the system matches the MSP’s operational standards.
|
||
|
||
**Why this priority**: Settings are required for enterprise adoption; hardcoded severity and alert thresholds don’t fit different environments.
|
||
|
||
**Independent Test**: Can be fully tested by setting workspace overrides and verifying that newly created baseline findings inherit the configured severity and that alert generation respects the configured minimum severity.
|
||
|
||
**Acceptance Scenarios**:
|
||
|
||
1. **Given** a workspace-level baseline severity mapping override, **When** a new baseline drift finding is created, **Then** it uses the mapped severity (and rejects invalid severity values).
|
||
2. **Given** a workspace-level baseline alert minimum severity override, **When** baseline findings are evaluated for alerts, **Then** only findings meeting or exceeding that threshold emit `baseline_high_drift` events.
|
||
|
||
### Edge Cases
|
||
|
||
- Compare completes but is not “fully successful” (e.g., `summary_counts.failed > 0`, incomplete processing where `summary_counts.processed != summary_counts.total`, or compare preconditions prevent the run from being created): auto-close MUST NOT occur.
|
||
- Compare does not evaluate all assigned items (e.g., missing baseline snapshot or assignment changes mid-run): auto-close MUST NOT resolve findings for items not evaluated.
|
||
- A baseline finding was resolved previously and reappears later: it must transition to an actionable open state (e.g., `reopened`) and be eligible for alerting once.
|
||
- Workspace settings payload is malformed (unknown drift categories or invalid severity values): save MUST be rejected and effective values MUST remain unchanged.
|
||
|
||
## Requirements *(mandatory)*
|
||
|
||
**Constitution alignment (required):** If this feature introduces any Microsoft Graph calls, any write/change behavior,
|
||
or any long-running/queued/scheduled work, the spec MUST describe contract registry updates, safety gates
|
||
(preview/confirmation/audit), tenant isolation, run observability (`OperationRun` type/identity/visibility), and tests.
|
||
If security-relevant DB-only actions intentionally skip `OperationRun`, the spec MUST describe `AuditLog` entries.
|
||
|
||
**Constitution alignment (OPS-UX):** If this feature creates/reuses an `OperationRun`, the spec MUST:
|
||
- explicitly state compliance with the Ops-UX 3-surface feedback contract (toast intent-only, progress surfaces, terminal DB notification),
|
||
- state that `OperationRun.status` / `OperationRun.outcome` transitions are service-owned (only via `OperationRunService`),
|
||
- describe how `summary_counts` keys/values comply with `OperationSummaryKeys::all()` and numeric-only rules,
|
||
- clarify scheduled/system-run behavior (initiator null → no terminal DB notification; audit is via Monitoring),
|
||
- list which regression guard tests are added/updated to keep these rules enforceable in CI.
|
||
|
||
**Constitution alignment (RBAC-UX):** If this feature introduces or changes authorization behavior, the spec MUST:
|
||
- state which authorization plane(s) are involved (tenant/admin `/admin` + tenant-context `/admin/t/{tenant}/...` vs platform `/system`),
|
||
- ensure any cross-plane access is deny-as-not-found (404),
|
||
- explicitly define 404 vs 403 semantics:
|
||
- non-member / not entitled to workspace scope OR tenant scope → 404 (deny-as-not-found)
|
||
- member but missing capability → 403
|
||
- describe how authorization is enforced server-side (Gates/Policies) for every mutation/operation-start/credential change,
|
||
- reference the canonical capability registry (no raw capability strings; no role-string checks in feature code),
|
||
- ensure global search is tenant-scoped and non-member-safe (no hints; inaccessible results treated as 404 semantics),
|
||
- ensure destructive-like actions require confirmation (`->requiresConfirmation()`),
|
||
- include at least one positive and one negative authorization test, and note any RBAC regression tests added/updated.
|
||
|
||
**Constitution alignment (OPS-EX-AUTH-001):** OIDC/SAML login handshakes may perform synchronous outbound HTTP (e.g., token exchange)
|
||
on `/auth/*` endpoints without an `OperationRun`. This MUST NOT be used for Monitoring/Operations pages.
|
||
|
||
**Constitution alignment (BADGE-001):** If this feature changes status-like badges (status/outcome/severity/risk/availability/boolean),
|
||
the spec MUST describe how badge semantics stay centralized (no ad-hoc mappings) and which tests cover any new/changed values.
|
||
|
||
**Constitution alignment (Filament Action Surfaces):** If this feature adds or modifies any Filament Resource / RelationManager / Page,
|
||
the spec MUST include a “UI Action Matrix” (see below) and explicitly state whether the Action Surface Contract is satisfied.
|
||
If the contract is not satisfied, the spec MUST include an explicit exemption with rationale.
|
||
**Constitution alignment (UX-001 — Layout & Information Architecture):** If this feature adds or modifies any Filament screen,
|
||
the spec MUST describe compliance with UX-001: Create/Edit uses Main/Aside layout (3-col grid), all fields inside Sections/Cards
|
||
(no naked inputs), View pages use Infolists (not disabled edit forms), status badges use BADGE-001, empty states have a specific
|
||
title + explanation + exactly 1 CTA, and tables provide search/sort/filters for core dimensions.
|
||
If UX-001 is not fully satisfied, the spec MUST include an explicit exemption with documented rationale.
|
||
|
||
### Functional Requirements
|
||
|
||
- **FR-001 (Finding source contract)**: Findings created by baseline compare MUST be identifiable as baseline compare findings via a stable source identifier (`source = baseline.compare`).
|
||
|
||
- **FR-002 (Fully successful guardrail)**: Auto-close MUST run only after a fully successful baseline compare run. “Fully successful” means all of the following:
|
||
- The compare run outcome is `succeeded`.
|
||
- The compare run emits Ops-UX canonical completeness counters in `OperationRun.summary_counts`, and they indicate:
|
||
- `summary_counts.failed == 0`
|
||
- `summary_counts.processed == summary_counts.total`
|
||
- Compare preconditions (e.g., missing active baseline snapshot, missing assignment) are enforced before enqueue and MUST prevent a compare run from being created; therefore auto-close cannot run when preconditions fail.
|
||
- **Implementation note (required invariant)**: precondition failures are returned as stable reason codes and MUST result in **no `OperationRun` being created**.
|
||
|
||
- **FR-003 (Safe auto-close behavior)**: After a fully successful compare, the system MUST resolve open baseline compare findings that are not present in the current run’s seen set.
|
||
- **FR-004 (No partial resolution)**: The system MUST NOT resolve findings for any items that were not evaluated in the run.
|
||
- **FR-005 (Resolution reason)**: Auto-resolved findings MUST record resolution reason `no_longer_drifting`.
|
||
|
||
- **FR-006 (Reopen semantics)**: If a previously resolved baseline compare finding reappears in a later compare, it MUST transition to an actionable open state (e.g., `reopened`) and be treated as “new actionable work” for alerting.
|
||
- The required open state for reappearance is `reopened`.
|
||
|
||
- **FR-007 (Alert event type: baseline drift)**: The system MUST support an alert event type `baseline_high_drift` for baseline compare findings.
|
||
- **FR-008 (Alert producer: baseline drift)**: Baseline drift alert events MUST be produced only for baseline compare findings that are actionable (open states) AND are either newly created or newly reopened within the evaluation window.
|
||
- **FR-009 (Baseline drift deduplication)**: Baseline drift alert events MUST be deduplicated by a stable key derived from the finding fingerprint. The same open finding MUST NOT emit repeated events on subsequent compares.
|
||
|
||
- **FR-010 (Alert event type: compare failed)**: The system MUST support an alert event type `baseline_compare_failed`.
|
||
- **FR-011 (Alert producer: compare failed)**: A `baseline_compare_failed` event MUST be produced when a baseline compare run completes with outcome failed or `partially_succeeded`.
|
||
- **FR-012 (Compare-failed dedup + cooldown)**: Compare-failed events MUST be deduplicated per run identity and MUST respect existing cooldown/quiet-hours behavior.
|
||
- This feature MUST NOT introduce a baseline-specific cooldown interval; it reuses the existing dispatcher cooldown behavior.
|
||
|
||
- **FR-013 (Canonical run types)**: Baseline capture and baseline compare MUST use centrally defined canonical run types (`baseline_capture`, `baseline_compare`) and MUST NOT rely on ad-hoc string literals.
|
||
|
||
- **FR-014 (Workspace settings: severity mapping)**: The system MUST support a workspace setting `baseline.severity_mapping` that maps baseline drift categories to severity.
|
||
- **FR-015 (Workspace settings: validation)**: The severity mapping MUST:
|
||
- accept only the baseline drift `change_type` keys `missing_policy`, `different_version`, and `unexpected_policy` (no other keys),
|
||
- reject invalid severity values,
|
||
- and expose “effective value” behavior (system defaults + workspace overrides).
|
||
- **FR-016 (Workspace settings: alert threshold)**: The system MUST support a workspace setting `baseline.alert_min_severity` with allowed values low/medium/high/critical and default high.
|
||
- Severity threshold comparison MUST use the canonical severity ordering: `low < medium < high < critical` (inclusive).
|
||
- **FR-017 (Workspace settings: auto-close toggle)**: The system MUST support a workspace setting `baseline.auto_close_enabled` defaulting to true.
|
||
- When set to `false`, auto-close MUST be skipped even if the compare is fully successful.
|
||
|
||
- **FR-018 (Information architecture / ownership)**: Baseline Profile CRUD MUST remain workspace-owned. It MUST NOT appear tenant-scoped, must not show tenant scope banners, and must not be reachable from tenant-only navigation.
|
||
|
||
#### Assumptions & Dependencies
|
||
|
||
- Baseline compare already produces stable finding fingerprints and a per-run Ops-UX `summary_counts` payload that can express completeness (`processed` vs `total`) and failures (`failed`).
|
||
- Findings support lifecycle transitions including resolve with a reason and reopen semantics for a recurring fingerprint.
|
||
- Alert dispatch already supports deduplication, cooldown, and quiet hours; this feature reuses that behavior for new baseline-specific event types.
|
||
|
||
#### Constitution Alignment Notes (non-functional but mandatory)
|
||
|
||
- This feature adds no new Microsoft Graph calls.
|
||
- Baseline compare and alert evaluation are long-running operations; any new auto-close and alert integration MUST preserve tenant isolation and run observability.
|
||
|
||
- **Ops-UX (3-surface feedback)**: baseline compare/capture and alerts evaluation must continue to provide:
|
||
- an intent-only toast on start,
|
||
- progress surfaces (Operations pages),
|
||
- and terminal DB notifications where applicable.
|
||
|
||
- **Operation run ownership**: Operation status/outcome transitions are owned by the operations subsystem and must not be mutated directly by UI code.
|
||
- **Summary counts contract**: Any summary counters produced/updated by this feature MUST use the canonical summary key registry and numeric-only values.
|
||
- **Scheduled/system runs**: Runs initiated without a human initiator MUST not produce terminal DB notifications; monitoring remains via Operations/Alerts.
|
||
|
||
- **RBAC-UX**: Authorization planes involved:
|
||
- Workspace management plane (admin, workspace-owned baselines + workspace settings)
|
||
- Tenant-context plane (baseline compare monitoring)
|
||
Cross-plane access MUST be deny-as-not-found (404).
|
||
- Non-member or not entitled to the workspace scope or tenant scope → 404 (deny-as-not-found)
|
||
- Member but missing capability for the surface → 403
|
||
|
||
- **BADGE-001**: Any new baseline severity mapping must remain centralized (single mapping source) and covered by tests.
|
||
|
||
## UI Action Matrix *(mandatory when Filament is changed)*
|
||
|
||
| Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions |
|
||
|---|---|---|---|---|---|---|---|---|---|---|
|
||
| Resource | Admin → Governance → Baselines (workspace management) | Create baseline profile (capability-gated) | View action | Edit (capability-gated), Delete/Archive (if present) | None | Create baseline profile | Capture / Compare shortcuts (if present), Edit | Save + Cancel | Yes | Workspace-owned; MUST NOT show tenant scope banner; must not appear in tenant nav. |
|
||
| Page | Admin → Settings → Workspace settings | Save (capability-gated) | N/A | N/A | N/A | N/A | N/A | Save + Cancel (or equivalent) | Yes | Adds baseline settings fields; validation must reject malformed mapping. |
|
||
| Page | Admin → Tenant context → Governance → Baseline Compare | Compare now (capability-gated) | Link to findings / operation run details | N/A | N/A | N/A | N/A | N/A | Yes | Tenant-context monitoring surface; must not expose workspace management actions. |
|
||
|
||
### Key Entities *(include if feature involves data)*
|
||
|
||
- **Baseline Compare Finding**: A finding produced by a baseline compare run, identified by `source = baseline.compare` and a stable fingerprint.
|
||
- **Baseline Compare Run**: A run that evaluates tenant configuration against a baseline profile and produces a compare summary that can indicate completeness.
|
||
- **Alert Event**: A deduplicated, rule-dispatchable representation of actionable baseline drift or baseline compare failure.
|
||
- **Workspace Baseline Settings**: Workspace-specific overrides for severity mapping, alert threshold, and auto-close enablement.
|
||
|
||
## Success Criteria *(mandatory)*
|
||
|
||
### Measurable Outcomes
|
||
|
||
- **SC-001 (Noise reduction)**: In a controlled test scenario where drift disappears, 100% of baseline drift findings created by baseline compare auto-resolve after the first fully successful compare.
|
||
- **SC-002 (Safety)**: In scenarios where compare is failed/`partially_succeeded`/incomplete, 0 baseline findings are auto-resolved.
|
||
- **SC-003 (Alert dedupe)**: The same open baseline drift finding does not generate more than 1 `baseline_high_drift` alert event per open/reopen cycle.
|
||
- **SC-004 (Timeliness)**: Baseline compare failures generate a `baseline_compare_failed` alert event within the next alert evaluation cycle.
|
||
- **SC-005 (Configurability)**: A workspace admin can change baseline severity mapping and minimum alert severity in under 2 minutes, and newly generated findings reflect the change. |