Implements spec `099-alerts-v1-teams-email`. - Monitoring navigation: Alerts as a cluster under Monitoring; default landing is Alert deliveries. - Tenant panel: Alerts points to `/admin/alerts` and the cluster navigation is hidden in tenant panel. - Guard compliance: removes direct `Gate::` usage from Alert resources so `NoAdHocFilamentAuthPatternsTest` passes. Verification: - Full suite: `1348 passed, 7 skipped` (EXIT=0). Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de> Reviewed-on: #121
207 lines
16 KiB
Markdown
207 lines
16 KiB
Markdown
# Feature Specification: Alerts v1 (Teams + Email)
|
||
|
||
**Feature Branch**: `099-alerts-v1-teams-email`
|
||
**Created**: 2026-02-16
|
||
**Status**: Draft
|
||
**Input**: User description: "Alerts v1 (Microsoft Teams + Email)"
|
||
|
||
## Spec Scope Fields *(mandatory)*
|
||
|
||
- **Scope**: workspace
|
||
- **Primary Routes**:
|
||
- Admin UI → Workspace → Monitoring → Alerts → Alert targets
|
||
- Admin UI → Workspace → Monitoring → Alerts → Alert rules
|
||
- Admin UI → Workspace → Monitoring → Alerts → Alert deliveries (read-only)
|
||
- **Data Ownership**:
|
||
- Workspace-owned alert configuration (rules + destinations)
|
||
- Tenant-owned alert delivery history (deliveries are always tenant-scoped)
|
||
- Deliveries are surfaced via workspace-context canonical UI routes, but MUST only reveal deliveries for tenants the actor is entitled to
|
||
- **Authorization Planes**:
|
||
- Admin panel `/admin` in **workspace-context** (workspace selected via session-based workspace context)
|
||
- This feature does **not** introduce tenant-context UI routes (no `/admin/t/{tenant}/...` pages for Alerts v1)
|
||
- **RBAC**:
|
||
- Workspace membership is required for any access (non-members are denied as not found / 404)
|
||
- Viewing alert configuration/history requires `ALERTS_VIEW`
|
||
- Creating/updating/enabling/disabling/deleting rules or destinations requires `ALERTS_MANAGE`
|
||
- Members without `ALERTS_VIEW` receive 403 for view-only access attempts
|
||
- Members without `ALERTS_MANAGE` receive 403 for mutation attempts; UI surfaces are disabled for them
|
||
- Viewing deliveries additionally requires tenant entitlement for each delivery’s tenant (non-entitled tenants are filtered and treated as not found / 404 semantics)
|
||
|
||
## Clarifications
|
||
|
||
### Session 2026-02-16
|
||
|
||
- Q: Should viewing the Alerts pages (Targets / Rules / Deliveries) require `ALERTS_VIEW`, or is workspace membership alone enough to view? → A: Viewing requires `ALERTS_VIEW` (members without it get 403).
|
||
- Q: When an event is suppressed due to cooldown/dedupe, should the system still create an entry in the delivery history (status = `suppressed`)? → A: Yes, create delivery history entries with `status=suppressed` (no send attempted).
|
||
- Q: Should `operation.compare_failed` fire only for the single canonical run type (baseline compare), or should v1 allow a per-rule run-type allowlist? → A: Fixed: only baseline compare failures (single canonical run type).
|
||
- Q: For quiet hours evaluation, what timezone should be used when a rule does not specify a timezone? → A: Fallback to workspace timezone.
|
||
- Q: When a Teams/email delivery attempt fails, which retry policy should v1 use? → A: Retry with exponential backoff up to a max attempts limit; then mark `failed`.
|
||
|
||
### Assumptions & Dependencies
|
||
|
||
- Drift findings and operational run outcomes already exist in the system and can be evaluated for alert triggers.
|
||
- Events are attributable to a workspace and (where applicable) a tenant so rules can apply tenant scoping.
|
||
- SLA-due alerts only apply if the underlying finding data includes a due date; otherwise this trigger is a no-op.
|
||
|
||
## User Scenarios & Testing *(mandatory)*
|
||
|
||
<!--
|
||
IMPORTANT: User stories should be PRIORITIZED as user journeys ordered by importance.
|
||
Each user story/journey must be INDEPENDENTLY TESTABLE - meaning if you implement just ONE of them,
|
||
you should still have a viable MVP (Minimum Viable Product) that delivers value.
|
||
|
||
Assign priorities (P1, P2, P3, etc.) to each story, where P1 is the most critical.
|
||
Think of each story as a standalone slice of functionality that can be:
|
||
- Developed independently
|
||
- Tested independently
|
||
- Deployed independently
|
||
- Demonstrated to users independently
|
||
-->
|
||
|
||
### User Story 1 - Configure alert destinations (Priority: P1)
|
||
|
||
As a workspace operator, I can define alert destinations (Microsoft Teams and/or email) that can later be reused by multiple alert rules.
|
||
|
||
**Why this priority**: Without destinations, alerts cannot be delivered; this is the smallest useful slice.
|
||
|
||
**Independent Test**: Create a destination, then confirm it is listed, viewable, and can be enabled/disabled.
|
||
|
||
**Acceptance Scenarios**:
|
||
|
||
1. **Given** I have `ALERTS_MANAGE`, **When** I create a Microsoft Teams destination with a name and webhook URL, **Then** the destination is saved and appears in the destinations list as enabled.
|
||
2. **Given** I have `ALERTS_MANAGE`, **When** I create an Email destination with one or more recipient addresses, **Then** the destination is saved and appears in the destinations list.
|
||
3. **Given** a destination exists, **When** I disable it, **Then** it is not used for future alert deliveries.
|
||
|
||
---
|
||
|
||
### User Story 2 - Configure alert routing rules (Priority: P2)
|
||
|
||
As a workspace manager, I can configure routing rules so that only relevant events (by type, severity, and tenant scope) generate alerts, and each rule can notify multiple destinations.
|
||
|
||
**Why this priority**: Rules provide control over noise, scope, and who gets notified.
|
||
|
||
**Independent Test**: Create a rule with at least one destination, then trigger one matching event and confirm exactly one delivery is queued per destination.
|
||
|
||
**Acceptance Scenarios**:
|
||
|
||
1. **Given** I have `ALERTS_MANAGE`, **When** I create a rule that matches a specific event type and minimum severity, **Then** the rule is saved and appears as enabled.
|
||
2. **Given** I configure a rule with tenant scope = allowlist, **When** an event from a non-allowlisted tenant occurs, **Then** no delivery is created for that rule.
|
||
3. **Given** a rule has multiple destinations assigned, **When** a matching event occurs, **Then** deliveries are created for each enabled destination.
|
||
|
||
---
|
||
|
||
### User Story 3 - Deliver alerts safely (dedupe, cooldown, quiet hours) and review history (Priority: P3)
|
||
|
||
As an operator, I receive timely notifications for important events without spam, and I can review what was sent (or failed) in a delivery history view.
|
||
|
||
**Why this priority**: Alert quality and traceability are essential for governance and incident response.
|
||
|
||
**Independent Test**: Trigger the same event twice within cooldown and confirm only one notification is sent; enable quiet hours and confirm delivery is deferred.
|
||
|
||
**Acceptance Scenarios**:
|
||
|
||
1. **Given** a rule has a cooldown configured, **When** the same event repeats within the cooldown window, **Then** later deliveries are suppressed for that rule.
|
||
2. **Given** quiet hours are enabled for a rule and the current time is within quiet hours (evaluated in the rule’s configured timezone or workspace timezone fallback), **When** a matching event occurs, **Then** a delivery is scheduled for the next allowed window rather than sent immediately.
|
||
3. **Given** I have `ALERTS_VIEW`, **When** I open the deliveries viewer, **Then** I can see delivery status and timestamps without exposing destination secrets.
|
||
4. **Given** an event is suppressed by cooldown, **When** I open the deliveries viewer, **Then** I can see a `suppressed` delivery entry that references the rule and destination (without exposing destination secrets).
|
||
|
||
---
|
||
|
||
### Edge Cases
|
||
|
||
- Quiet hours windows that cross midnight must still defer correctly to the next allowed time.
|
||
- Multiple background workers triggering the same event concurrently must not cause duplicate sends.
|
||
- A destination that is misconfigured (invalid webhook URL or invalid email address list) must fail safely and record a sanitized failure reason (no secrets).
|
||
- The UI must not make outbound network requests while rendering pages (no external calls during page load).
|
||
- SLA-due alerts are a no-op if the underlying data does not provide a due date yet (no errors; no false alerts).
|
||
|
||
## Requirements *(mandatory)*
|
||
|
||
**Constitution alignment (required):** If this feature introduces any Microsoft Graph calls, any write/change behavior,
|
||
or any long-running/queued/scheduled work, the spec MUST describe contract registry updates, safety gates
|
||
(preview/confirmation/audit), tenant isolation, run observability (`OperationRun` type/identity/visibility), and tests.
|
||
If security-relevant DB-only actions intentionally skip `OperationRun`, the spec MUST describe `AuditLog` entries.
|
||
|
||
**Constitution alignment (RBAC-UX):** If this feature introduces or changes authorization behavior, the spec MUST:
|
||
- state which authorization plane(s) are involved (tenant/admin `/admin` + tenant-context `/admin/t/{tenant}/...` vs platform `/system`),
|
||
- ensure any cross-plane access is deny-as-not-found (404),
|
||
- explicitly define 404 vs 403 semantics:
|
||
- non-member / not entitled to workspace scope OR tenant scope → 404 (deny-as-not-found)
|
||
- member but missing capability → 403
|
||
- describe how authorization is enforced server-side (Gates/Policies) for every mutation/operation-start/credential change,
|
||
- reference the canonical capability registry (no raw capability strings; no role-string checks in feature code),
|
||
- ensure global search is tenant-scoped and non-member-safe (no hints; inaccessible results treated as 404 semantics),
|
||
- ensure destructive-like actions require confirmation (`->requiresConfirmation()`),
|
||
- include at least one positive and one negative authorization test, and note any RBAC regression tests added/updated.
|
||
|
||
**Constitution alignment (OPS-EX-AUTH-001):** OIDC/SAML login handshakes may perform synchronous outbound HTTP (e.g., token exchange)
|
||
on `/auth/*` endpoints without an `OperationRun`. This MUST NOT be used for Monitoring/Operations pages.
|
||
|
||
**Constitution alignment (BADGE-001):** If this feature changes status-like badges (status/outcome/severity/risk/availability/boolean),
|
||
the spec MUST describe how badge semantics stay centralized (no ad-hoc mappings) and which tests cover any new/changed values.
|
||
|
||
**Constitution alignment (Filament Action Surfaces):** If this feature adds or modifies any Filament Resource / RelationManager / Page,
|
||
the spec MUST include a “UI Action Matrix” (see below) and explicitly state whether the Action Surface Contract is satisfied.
|
||
If the contract is not satisfied, the spec MUST include an explicit exemption with rationale.
|
||
|
||
### Functional Requirements
|
||
|
||
- **FR-001 (Channels)**: The system MUST support alert delivery via Microsoft Teams (workspace-configured webhook destination) and via email (one or more recipient addresses per destination).
|
||
- **FR-002 (Workspace scoping)**: The system MUST scope alert rules and destinations to a workspace; rules/destinations MUST NOT be shared across workspaces.
|
||
- **FR-003 (Routing rules)**: The system MUST allow rules to filter alert generation by:
|
||
- event type
|
||
- minimum severity
|
||
- tenant scope (all tenants or allowlist)
|
||
- **FR-004 (Multiple destinations)**: The system MUST allow a rule to notify multiple destinations.
|
||
- **FR-005 (Event triggers v1)**: The system MUST support these trigger types:
|
||
- High Drift: when a new drift finding first appears (i.e., a new drift finding is created) with severity High or Critical, and it is in an unacknowledged/new state
|
||
- Compare Failed: when a baseline-compare operation run fails (fixed in v1; not configurable per rule)
|
||
- SLA Due: when a finding passes its due date and remains unresolved (if due date data is available)
|
||
- **FR-006 (Idempotency / dedupe)**: The system MUST prevent duplicate notifications for repeated occurrences of the same event for a given rule, using a deterministic event fingerprint that contains no secrets.
|
||
- **FR-007 (Cooldown)**: The system MUST support a per-rule cooldown window during which repeated fingerprints are suppressed.
|
||
- **FR-007a (Suppression visibility)**: When a notification is suppressed by cooldown/dedupe, the system MUST persist an entry in delivery history with `status=suppressed`.
|
||
- **FR-008 (Quiet hours)**: The system MUST support optional quiet hours per rule; events during quiet hours MUST be deferred to the next allowed time window.
|
||
- **FR-009 (Quiet hours timezone)**: Quiet hours MUST be evaluated in the rule’s configured timezone; if not set, the system MUST fallback to the workspace timezone.
|
||
- **FR-010 (Delivery history)**: The system MUST retain a delivery history view showing, at minimum: status (queued/deferred/sent/failed/suppressed/canceled), timestamps, event type, severity, tenant association, and the rule + destination used.
|
||
- **FR-011 (Safe logging)**: The system MUST NOT persist destination secrets (webhook URLs, email recipient lists) in logs, error messages, or audit payloads.
|
||
- **FR-012 (Auditability)**: The system MUST write auditable events for creation, updates, enable/disable, and deletion of rules and destinations.
|
||
- **FR-013 (Operations observability)**: Background work that scans for due alerts and performs alert delivery MUST be observable as operations runs with outcome and timestamps, so operators can diagnose failures.
|
||
- **FR-014 (RBAC semantics)**: Authorization MUST follow these semantics:
|
||
- non-member / not entitled to workspace scope → 404
|
||
- member missing `ALERTS_VIEW` → 403 for viewing alert pages
|
||
- member missing `ALERTS_MANAGE` → 403 for create/update/delete/enable/disable
|
||
- **FR-015 (DB-only rendering)**: Alert management and delivery history pages MUST render without any outbound network requests.
|
||
- **FR-016 (Retention)**: Delivery history MUST be retained for 90 days by default.
|
||
- **FR-017 (Delivery retries)**: On delivery failure (Teams/email), the system MUST retry with exponential backoff up to a bounded maximum attempt limit; once the limit is reached, the delivery MUST be marked `failed`.
|
||
|
||
## UI Action Matrix *(mandatory when Filament is changed)*
|
||
|
||
If this feature adds/modifies any Filament Resource / RelationManager / Page, fill out the matrix below.
|
||
|
||
For each surface, list the exact action labels, whether they are destructive (confirmation? typed confirmation?),
|
||
RBAC gating (capability + enforcement helper), and whether the mutation writes an audit log.
|
||
|
||
| Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions |
|
||
|---|---|---|---|---|---|---|---|---|---|---|
|
||
| Alert Targets | Workspace → Monitoring → Alerts → Alert targets | Create target | Clickable row to Edit | Edit, More (Enable/Disable, Delete) | More (group rendered; no bulk mutations in v1) | Create target | None | Save, Cancel | Yes | Delete requires confirmation; secrets never shown/logged |
|
||
| Alert Rules | Workspace → Monitoring → Alerts → Alert rules | Create rule | Clickable row to Edit | Edit, More (Enable/Disable, Delete) | More (group rendered; no bulk mutations in v1) | Create rule | None | Save, Cancel | Yes | Enable/Disable and Delete are audited; both require confirmation |
|
||
| Alert Deliveries (read-only) | Workspace → Monitoring → Alerts → Alert deliveries | None | Clickable row to View | View | None | None | None | N/A | No | Read-only viewer; tenant entitlement filtering enforced |
|
||
|
||
### Key Entities *(include if feature involves data)*
|
||
|
||
- **Alert Destination**: A workspace-defined place to send notifications (Teams or email), which can be enabled/disabled.
|
||
- **Alert Rule**: A workspace-defined routing rule that decides which events should generate alerts and which destinations they should notify.
|
||
- **Alert Event**: A notable system occurrence (e.g., high drift, compare failure, SLA due) that may generate alerts.
|
||
- **Event Fingerprint**: A stable, deterministic identifier used to deduplicate repeated events per rule.
|
||
- **Alert Delivery**: A record of a planned or attempted notification send, including scheduling (quiet hours), status, and timestamps.
|
||
|
||
## Success Criteria *(mandatory)*
|
||
|
||
### Measurable Outcomes
|
||
|
||
- **SC-001 (Setup time)**: A workspace manager can create a destination and a rule and enable it in under 5 minutes.
|
||
- **SC-002 (Delivery timeliness)**: Outside quiet hours, at least 95% of eligible alerts are delivered within 2 minutes of the triggering event.
|
||
- **SC-003 (Noise control)**: Within a configured cooldown window, the same fingerprint does not generate more than one notification per rule.
|
||
- **SC-004 (Security hygiene)**: No destination secrets appear in application logs or audit payloads during normal operation or error cases.
|
||
- **SC-005 (Audit traceability)**: 100% of rule/destination create/update/enable/disable/delete actions are traceable via an audit record.
|