TenantAtlas/specs/099-alerts-v1-teams-email/spec.md
ahmido 3ed275cef3 feat(alerts): Monitoring cluster + v1 resources (spec 099) (#121)
Implements spec `099-alerts-v1-teams-email`.

- Monitoring navigation: Alerts as a cluster under Monitoring; default landing is Alert deliveries.
- Tenant panel: Alerts points to `/admin/alerts` and the cluster navigation is hidden in tenant panel.
- Guard compliance: removes direct `Gate::` usage from Alert resources so `NoAdHocFilamentAuthPatternsTest` passes.

Verification:
- Full suite: `1348 passed, 7 skipped` (EXIT=0).

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #121
2026-02-18 15:20:43 +00:00

207 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Feature Specification: Alerts v1 (Teams + Email)
**Feature Branch**: `099-alerts-v1-teams-email`
**Created**: 2026-02-16
**Status**: Draft
**Input**: User description: "Alerts v1 (Microsoft Teams + Email)"
## Spec Scope Fields *(mandatory)*
- **Scope**: workspace
- **Primary Routes**:
- Admin UI → Workspace → Monitoring → Alerts → Alert targets
- Admin UI → Workspace → Monitoring → Alerts → Alert rules
- Admin UI → Workspace → Monitoring → Alerts → Alert deliveries (read-only)
- **Data Ownership**:
- Workspace-owned alert configuration (rules + destinations)
- Tenant-owned alert delivery history (deliveries are always tenant-scoped)
- Deliveries are surfaced via workspace-context canonical UI routes, but MUST only reveal deliveries for tenants the actor is entitled to
- **Authorization Planes**:
- Admin panel `/admin` in **workspace-context** (workspace selected via session-based workspace context)
- This feature does **not** introduce tenant-context UI routes (no `/admin/t/{tenant}/...` pages for Alerts v1)
- **RBAC**:
- Workspace membership is required for any access (non-members are denied as not found / 404)
- Viewing alert configuration/history requires `ALERTS_VIEW`
- Creating/updating/enabling/disabling/deleting rules or destinations requires `ALERTS_MANAGE`
- Members without `ALERTS_VIEW` receive 403 for view-only access attempts
- Members without `ALERTS_MANAGE` receive 403 for mutation attempts; UI surfaces are disabled for them
- Viewing deliveries additionally requires tenant entitlement for each deliverys tenant (non-entitled tenants are filtered and treated as not found / 404 semantics)
## Clarifications
### Session 2026-02-16
- Q: Should viewing the Alerts pages (Targets / Rules / Deliveries) require `ALERTS_VIEW`, or is workspace membership alone enough to view? → A: Viewing requires `ALERTS_VIEW` (members without it get 403).
- Q: When an event is suppressed due to cooldown/dedupe, should the system still create an entry in the delivery history (status = `suppressed`)? → A: Yes, create delivery history entries with `status=suppressed` (no send attempted).
- Q: Should `operation.compare_failed` fire only for the single canonical run type (baseline compare), or should v1 allow a per-rule run-type allowlist? → A: Fixed: only baseline compare failures (single canonical run type).
- Q: For quiet hours evaluation, what timezone should be used when a rule does not specify a timezone? → A: Fallback to workspace timezone.
- Q: When a Teams/email delivery attempt fails, which retry policy should v1 use? → A: Retry with exponential backoff up to a max attempts limit; then mark `failed`.
### Assumptions & Dependencies
- Drift findings and operational run outcomes already exist in the system and can be evaluated for alert triggers.
- Events are attributable to a workspace and (where applicable) a tenant so rules can apply tenant scoping.
- SLA-due alerts only apply if the underlying finding data includes a due date; otherwise this trigger is a no-op.
## User Scenarios & Testing *(mandatory)*
<!--
IMPORTANT: User stories should be PRIORITIZED as user journeys ordered by importance.
Each user story/journey must be INDEPENDENTLY TESTABLE - meaning if you implement just ONE of them,
you should still have a viable MVP (Minimum Viable Product) that delivers value.
Assign priorities (P1, P2, P3, etc.) to each story, where P1 is the most critical.
Think of each story as a standalone slice of functionality that can be:
- Developed independently
- Tested independently
- Deployed independently
- Demonstrated to users independently
-->
### User Story 1 - Configure alert destinations (Priority: P1)
As a workspace operator, I can define alert destinations (Microsoft Teams and/or email) that can later be reused by multiple alert rules.
**Why this priority**: Without destinations, alerts cannot be delivered; this is the smallest useful slice.
**Independent Test**: Create a destination, then confirm it is listed, viewable, and can be enabled/disabled.
**Acceptance Scenarios**:
1. **Given** I have `ALERTS_MANAGE`, **When** I create a Microsoft Teams destination with a name and webhook URL, **Then** the destination is saved and appears in the destinations list as enabled.
2. **Given** I have `ALERTS_MANAGE`, **When** I create an Email destination with one or more recipient addresses, **Then** the destination is saved and appears in the destinations list.
3. **Given** a destination exists, **When** I disable it, **Then** it is not used for future alert deliveries.
---
### User Story 2 - Configure alert routing rules (Priority: P2)
As a workspace manager, I can configure routing rules so that only relevant events (by type, severity, and tenant scope) generate alerts, and each rule can notify multiple destinations.
**Why this priority**: Rules provide control over noise, scope, and who gets notified.
**Independent Test**: Create a rule with at least one destination, then trigger one matching event and confirm exactly one delivery is queued per destination.
**Acceptance Scenarios**:
1. **Given** I have `ALERTS_MANAGE`, **When** I create a rule that matches a specific event type and minimum severity, **Then** the rule is saved and appears as enabled.
2. **Given** I configure a rule with tenant scope = allowlist, **When** an event from a non-allowlisted tenant occurs, **Then** no delivery is created for that rule.
3. **Given** a rule has multiple destinations assigned, **When** a matching event occurs, **Then** deliveries are created for each enabled destination.
---
### User Story 3 - Deliver alerts safely (dedupe, cooldown, quiet hours) and review history (Priority: P3)
As an operator, I receive timely notifications for important events without spam, and I can review what was sent (or failed) in a delivery history view.
**Why this priority**: Alert quality and traceability are essential for governance and incident response.
**Independent Test**: Trigger the same event twice within cooldown and confirm only one notification is sent; enable quiet hours and confirm delivery is deferred.
**Acceptance Scenarios**:
1. **Given** a rule has a cooldown configured, **When** the same event repeats within the cooldown window, **Then** later deliveries are suppressed for that rule.
2. **Given** quiet hours are enabled for a rule and the current time is within quiet hours (evaluated in the rules configured timezone or workspace timezone fallback), **When** a matching event occurs, **Then** a delivery is scheduled for the next allowed window rather than sent immediately.
3. **Given** I have `ALERTS_VIEW`, **When** I open the deliveries viewer, **Then** I can see delivery status and timestamps without exposing destination secrets.
4. **Given** an event is suppressed by cooldown, **When** I open the deliveries viewer, **Then** I can see a `suppressed` delivery entry that references the rule and destination (without exposing destination secrets).
---
### Edge Cases
- Quiet hours windows that cross midnight must still defer correctly to the next allowed time.
- Multiple background workers triggering the same event concurrently must not cause duplicate sends.
- A destination that is misconfigured (invalid webhook URL or invalid email address list) must fail safely and record a sanitized failure reason (no secrets).
- The UI must not make outbound network requests while rendering pages (no external calls during page load).
- SLA-due alerts are a no-op if the underlying data does not provide a due date yet (no errors; no false alerts).
## Requirements *(mandatory)*
**Constitution alignment (required):** If this feature introduces any Microsoft Graph calls, any write/change behavior,
or any long-running/queued/scheduled work, the spec MUST describe contract registry updates, safety gates
(preview/confirmation/audit), tenant isolation, run observability (`OperationRun` type/identity/visibility), and tests.
If security-relevant DB-only actions intentionally skip `OperationRun`, the spec MUST describe `AuditLog` entries.
**Constitution alignment (RBAC-UX):** If this feature introduces or changes authorization behavior, the spec MUST:
- state which authorization plane(s) are involved (tenant/admin `/admin` + tenant-context `/admin/t/{tenant}/...` vs platform `/system`),
- ensure any cross-plane access is deny-as-not-found (404),
- explicitly define 404 vs 403 semantics:
- non-member / not entitled to workspace scope OR tenant scope → 404 (deny-as-not-found)
- member but missing capability → 403
- describe how authorization is enforced server-side (Gates/Policies) for every mutation/operation-start/credential change,
- reference the canonical capability registry (no raw capability strings; no role-string checks in feature code),
- ensure global search is tenant-scoped and non-member-safe (no hints; inaccessible results treated as 404 semantics),
- ensure destructive-like actions require confirmation (`->requiresConfirmation()`),
- include at least one positive and one negative authorization test, and note any RBAC regression tests added/updated.
**Constitution alignment (OPS-EX-AUTH-001):** OIDC/SAML login handshakes may perform synchronous outbound HTTP (e.g., token exchange)
on `/auth/*` endpoints without an `OperationRun`. This MUST NOT be used for Monitoring/Operations pages.
**Constitution alignment (BADGE-001):** If this feature changes status-like badges (status/outcome/severity/risk/availability/boolean),
the spec MUST describe how badge semantics stay centralized (no ad-hoc mappings) and which tests cover any new/changed values.
**Constitution alignment (Filament Action Surfaces):** If this feature adds or modifies any Filament Resource / RelationManager / Page,
the spec MUST include a “UI Action Matrix” (see below) and explicitly state whether the Action Surface Contract is satisfied.
If the contract is not satisfied, the spec MUST include an explicit exemption with rationale.
### Functional Requirements
- **FR-001 (Channels)**: The system MUST support alert delivery via Microsoft Teams (workspace-configured webhook destination) and via email (one or more recipient addresses per destination).
- **FR-002 (Workspace scoping)**: The system MUST scope alert rules and destinations to a workspace; rules/destinations MUST NOT be shared across workspaces.
- **FR-003 (Routing rules)**: The system MUST allow rules to filter alert generation by:
- event type
- minimum severity
- tenant scope (all tenants or allowlist)
- **FR-004 (Multiple destinations)**: The system MUST allow a rule to notify multiple destinations.
- **FR-005 (Event triggers v1)**: The system MUST support these trigger types:
- High Drift: when a new drift finding first appears (i.e., a new drift finding is created) with severity High or Critical, and it is in an unacknowledged/new state
- Compare Failed: when a baseline-compare operation run fails (fixed in v1; not configurable per rule)
- SLA Due: when a finding passes its due date and remains unresolved (if due date data is available)
- **FR-006 (Idempotency / dedupe)**: The system MUST prevent duplicate notifications for repeated occurrences of the same event for a given rule, using a deterministic event fingerprint that contains no secrets.
- **FR-007 (Cooldown)**: The system MUST support a per-rule cooldown window during which repeated fingerprints are suppressed.
- **FR-007a (Suppression visibility)**: When a notification is suppressed by cooldown/dedupe, the system MUST persist an entry in delivery history with `status=suppressed`.
- **FR-008 (Quiet hours)**: The system MUST support optional quiet hours per rule; events during quiet hours MUST be deferred to the next allowed time window.
- **FR-009 (Quiet hours timezone)**: Quiet hours MUST be evaluated in the rules configured timezone; if not set, the system MUST fallback to the workspace timezone.
- **FR-010 (Delivery history)**: The system MUST retain a delivery history view showing, at minimum: status (queued/deferred/sent/failed/suppressed/canceled), timestamps, event type, severity, tenant association, and the rule + destination used.
- **FR-011 (Safe logging)**: The system MUST NOT persist destination secrets (webhook URLs, email recipient lists) in logs, error messages, or audit payloads.
- **FR-012 (Auditability)**: The system MUST write auditable events for creation, updates, enable/disable, and deletion of rules and destinations.
- **FR-013 (Operations observability)**: Background work that scans for due alerts and performs alert delivery MUST be observable as operations runs with outcome and timestamps, so operators can diagnose failures.
- **FR-014 (RBAC semantics)**: Authorization MUST follow these semantics:
- non-member / not entitled to workspace scope → 404
- member missing `ALERTS_VIEW` → 403 for viewing alert pages
- member missing `ALERTS_MANAGE` → 403 for create/update/delete/enable/disable
- **FR-015 (DB-only rendering)**: Alert management and delivery history pages MUST render without any outbound network requests.
- **FR-016 (Retention)**: Delivery history MUST be retained for 90 days by default.
- **FR-017 (Delivery retries)**: On delivery failure (Teams/email), the system MUST retry with exponential backoff up to a bounded maximum attempt limit; once the limit is reached, the delivery MUST be marked `failed`.
## UI Action Matrix *(mandatory when Filament is changed)*
If this feature adds/modifies any Filament Resource / RelationManager / Page, fill out the matrix below.
For each surface, list the exact action labels, whether they are destructive (confirmation? typed confirmation?),
RBAC gating (capability + enforcement helper), and whether the mutation writes an audit log.
| Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions |
|---|---|---|---|---|---|---|---|---|---|---|
| Alert Targets | Workspace → Monitoring → Alerts → Alert targets | Create target | Clickable row to Edit | Edit, More (Enable/Disable, Delete) | More (group rendered; no bulk mutations in v1) | Create target | None | Save, Cancel | Yes | Delete requires confirmation; secrets never shown/logged |
| Alert Rules | Workspace → Monitoring → Alerts → Alert rules | Create rule | Clickable row to Edit | Edit, More (Enable/Disable, Delete) | More (group rendered; no bulk mutations in v1) | Create rule | None | Save, Cancel | Yes | Enable/Disable and Delete are audited; both require confirmation |
| Alert Deliveries (read-only) | Workspace → Monitoring → Alerts → Alert deliveries | None | Clickable row to View | View | None | None | None | N/A | No | Read-only viewer; tenant entitlement filtering enforced |
### Key Entities *(include if feature involves data)*
- **Alert Destination**: A workspace-defined place to send notifications (Teams or email), which can be enabled/disabled.
- **Alert Rule**: A workspace-defined routing rule that decides which events should generate alerts and which destinations they should notify.
- **Alert Event**: A notable system occurrence (e.g., high drift, compare failure, SLA due) that may generate alerts.
- **Event Fingerprint**: A stable, deterministic identifier used to deduplicate repeated events per rule.
- **Alert Delivery**: A record of a planned or attempted notification send, including scheduling (quiet hours), status, and timestamps.
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-001 (Setup time)**: A workspace manager can create a destination and a rule and enable it in under 5 minutes.
- **SC-002 (Delivery timeliness)**: Outside quiet hours, at least 95% of eligible alerts are delivered within 2 minutes of the triggering event.
- **SC-003 (Noise control)**: Within a configured cooldown window, the same fingerprint does not generate more than one notification per rule.
- **SC-004 (Security hygiene)**: No destination secrets appear in application logs or audit payloads during normal operation or error cases.
- **SC-005 (Audit traceability)**: 100% of rule/destination create/update/enable/disable/delete actions are traceable via an audit record.