# Feature Specification: Alerts v1 (Teams + Email) **Feature Branch**: `099-alerts-v1-teams-email` **Created**: 2026-02-16 **Status**: Draft **Input**: User description: "Alerts v1 (Microsoft Teams + Email)" ## Spec Scope Fields *(mandatory)* - **Scope**: workspace - **Primary Routes**: - Admin UI → Workspace → Monitoring → Alerts → Alert targets - Admin UI → Workspace → Monitoring → Alerts → Alert rules - Admin UI → Workspace → Monitoring → Alerts → Alert deliveries (read-only) - **Data Ownership**: - Workspace-owned alert configuration (rules + destinations) - Tenant-owned alert delivery history (deliveries are always tenant-scoped) - Deliveries are surfaced via workspace-context canonical UI routes, but MUST only reveal deliveries for tenants the actor is entitled to - **Authorization Planes**: - Admin panel `/admin` in **workspace-context** (workspace selected via session-based workspace context) - This feature does **not** introduce tenant-context UI routes (no `/admin/t/{tenant}/...` pages for Alerts v1) - **RBAC**: - Workspace membership is required for any access (non-members are denied as not found / 404) - Viewing alert configuration/history requires `ALERTS_VIEW` - Creating/updating/enabling/disabling/deleting rules or destinations requires `ALERTS_MANAGE` - Members without `ALERTS_VIEW` receive 403 for view-only access attempts - Members without `ALERTS_MANAGE` receive 403 for mutation attempts; UI surfaces are disabled for them - Viewing deliveries additionally requires tenant entitlement for each delivery’s tenant (non-entitled tenants are filtered and treated as not found / 404 semantics) ## Clarifications ### Session 2026-02-16 - Q: Should viewing the Alerts pages (Targets / Rules / Deliveries) require `ALERTS_VIEW`, or is workspace membership alone enough to view? → A: Viewing requires `ALERTS_VIEW` (members without it get 403). - Q: When an event is suppressed due to cooldown/dedupe, should the system still create an entry in the delivery history (status = `suppressed`)? → A: Yes, create delivery history entries with `status=suppressed` (no send attempted). - Q: Should `operation.compare_failed` fire only for the single canonical run type (baseline compare), or should v1 allow a per-rule run-type allowlist? → A: Fixed: only baseline compare failures (single canonical run type). - Q: For quiet hours evaluation, what timezone should be used when a rule does not specify a timezone? → A: Fallback to workspace timezone. - Q: When a Teams/email delivery attempt fails, which retry policy should v1 use? → A: Retry with exponential backoff up to a max attempts limit; then mark `failed`. ### Assumptions & Dependencies - Drift findings and operational run outcomes already exist in the system and can be evaluated for alert triggers. - Events are attributable to a workspace and (where applicable) a tenant so rules can apply tenant scoping. - SLA-due alerts only apply if the underlying finding data includes a due date; otherwise this trigger is a no-op. ## User Scenarios & Testing *(mandatory)* ### User Story 1 - Configure alert destinations (Priority: P1) As a workspace operator, I can define alert destinations (Microsoft Teams and/or email) that can later be reused by multiple alert rules. **Why this priority**: Without destinations, alerts cannot be delivered; this is the smallest useful slice. **Independent Test**: Create a destination, then confirm it is listed, viewable, and can be enabled/disabled. **Acceptance Scenarios**: 1. **Given** I have `ALERTS_MANAGE`, **When** I create a Microsoft Teams destination with a name and webhook URL, **Then** the destination is saved and appears in the destinations list as enabled. 2. **Given** I have `ALERTS_MANAGE`, **When** I create an Email destination with one or more recipient addresses, **Then** the destination is saved and appears in the destinations list. 3. **Given** a destination exists, **When** I disable it, **Then** it is not used for future alert deliveries. --- ### User Story 2 - Configure alert routing rules (Priority: P2) As a workspace manager, I can configure routing rules so that only relevant events (by type, severity, and tenant scope) generate alerts, and each rule can notify multiple destinations. **Why this priority**: Rules provide control over noise, scope, and who gets notified. **Independent Test**: Create a rule with at least one destination, then trigger one matching event and confirm exactly one delivery is queued per destination. **Acceptance Scenarios**: 1. **Given** I have `ALERTS_MANAGE`, **When** I create a rule that matches a specific event type and minimum severity, **Then** the rule is saved and appears as enabled. 2. **Given** I configure a rule with tenant scope = allowlist, **When** an event from a non-allowlisted tenant occurs, **Then** no delivery is created for that rule. 3. **Given** a rule has multiple destinations assigned, **When** a matching event occurs, **Then** deliveries are created for each enabled destination. --- ### User Story 3 - Deliver alerts safely (dedupe, cooldown, quiet hours) and review history (Priority: P3) As an operator, I receive timely notifications for important events without spam, and I can review what was sent (or failed) in a delivery history view. **Why this priority**: Alert quality and traceability are essential for governance and incident response. **Independent Test**: Trigger the same event twice within cooldown and confirm only one notification is sent; enable quiet hours and confirm delivery is deferred. **Acceptance Scenarios**: 1. **Given** a rule has a cooldown configured, **When** the same event repeats within the cooldown window, **Then** later deliveries are suppressed for that rule. 2. **Given** quiet hours are enabled for a rule and the current time is within quiet hours (evaluated in the rule’s configured timezone or workspace timezone fallback), **When** a matching event occurs, **Then** a delivery is scheduled for the next allowed window rather than sent immediately. 3. **Given** I have `ALERTS_VIEW`, **When** I open the deliveries viewer, **Then** I can see delivery status and timestamps without exposing destination secrets. 4. **Given** an event is suppressed by cooldown, **When** I open the deliveries viewer, **Then** I can see a `suppressed` delivery entry that references the rule and destination (without exposing destination secrets). --- ### Edge Cases - Quiet hours windows that cross midnight must still defer correctly to the next allowed time. - Multiple background workers triggering the same event concurrently must not cause duplicate sends. - A destination that is misconfigured (invalid webhook URL or invalid email address list) must fail safely and record a sanitized failure reason (no secrets). - The UI must not make outbound network requests while rendering pages (no external calls during page load). - SLA-due alerts are a no-op if the underlying data does not provide a due date yet (no errors; no false alerts). ## Requirements *(mandatory)* **Constitution alignment (required):** If this feature introduces any Microsoft Graph calls, any write/change behavior, or any long-running/queued/scheduled work, the spec MUST describe contract registry updates, safety gates (preview/confirmation/audit), tenant isolation, run observability (`OperationRun` type/identity/visibility), and tests. If security-relevant DB-only actions intentionally skip `OperationRun`, the spec MUST describe `AuditLog` entries. **Constitution alignment (RBAC-UX):** If this feature introduces or changes authorization behavior, the spec MUST: - state which authorization plane(s) are involved (tenant/admin `/admin` + tenant-context `/admin/t/{tenant}/...` vs platform `/system`), - ensure any cross-plane access is deny-as-not-found (404), - explicitly define 404 vs 403 semantics: - non-member / not entitled to workspace scope OR tenant scope → 404 (deny-as-not-found) - member but missing capability → 403 - describe how authorization is enforced server-side (Gates/Policies) for every mutation/operation-start/credential change, - reference the canonical capability registry (no raw capability strings; no role-string checks in feature code), - ensure global search is tenant-scoped and non-member-safe (no hints; inaccessible results treated as 404 semantics), - ensure destructive-like actions require confirmation (`->requiresConfirmation()`), - include at least one positive and one negative authorization test, and note any RBAC regression tests added/updated. **Constitution alignment (OPS-EX-AUTH-001):** OIDC/SAML login handshakes may perform synchronous outbound HTTP (e.g., token exchange) on `/auth/*` endpoints without an `OperationRun`. This MUST NOT be used for Monitoring/Operations pages. **Constitution alignment (BADGE-001):** If this feature changes status-like badges (status/outcome/severity/risk/availability/boolean), the spec MUST describe how badge semantics stay centralized (no ad-hoc mappings) and which tests cover any new/changed values. **Constitution alignment (Filament Action Surfaces):** If this feature adds or modifies any Filament Resource / RelationManager / Page, the spec MUST include a “UI Action Matrix” (see below) and explicitly state whether the Action Surface Contract is satisfied. If the contract is not satisfied, the spec MUST include an explicit exemption with rationale. ### Functional Requirements - **FR-001 (Channels)**: The system MUST support alert delivery via Microsoft Teams (workspace-configured webhook destination) and via email (one or more recipient addresses per destination). - **FR-002 (Workspace scoping)**: The system MUST scope alert rules and destinations to a workspace; rules/destinations MUST NOT be shared across workspaces. - **FR-003 (Routing rules)**: The system MUST allow rules to filter alert generation by: - event type - minimum severity - tenant scope (all tenants or allowlist) - **FR-004 (Multiple destinations)**: The system MUST allow a rule to notify multiple destinations. - **FR-005 (Event triggers v1)**: The system MUST support these trigger types: - High Drift: when a new drift finding first appears (i.e., a new drift finding is created) with severity High or Critical, and it is in an unacknowledged/new state - Compare Failed: when a baseline-compare operation run fails (fixed in v1; not configurable per rule) - SLA Due: when a finding passes its due date and remains unresolved (if due date data is available) - **FR-006 (Idempotency / dedupe)**: The system MUST prevent duplicate notifications for repeated occurrences of the same event for a given rule, using a deterministic event fingerprint that contains no secrets. - **FR-007 (Cooldown)**: The system MUST support a per-rule cooldown window during which repeated fingerprints are suppressed. - **FR-007a (Suppression visibility)**: When a notification is suppressed by cooldown/dedupe, the system MUST persist an entry in delivery history with `status=suppressed`. - **FR-008 (Quiet hours)**: The system MUST support optional quiet hours per rule; events during quiet hours MUST be deferred to the next allowed time window. - **FR-009 (Quiet hours timezone)**: Quiet hours MUST be evaluated in the rule’s configured timezone; if not set, the system MUST fallback to the workspace timezone. - **FR-010 (Delivery history)**: The system MUST retain a delivery history view showing, at minimum: status (queued/deferred/sent/failed/suppressed/canceled), timestamps, event type, severity, tenant association, and the rule + destination used. - **FR-011 (Safe logging)**: The system MUST NOT persist destination secrets (webhook URLs, email recipient lists) in logs, error messages, or audit payloads. - **FR-012 (Auditability)**: The system MUST write auditable events for creation, updates, enable/disable, and deletion of rules and destinations. - **FR-013 (Operations observability)**: Background work that scans for due alerts and performs alert delivery MUST be observable as operations runs with outcome and timestamps, so operators can diagnose failures. - **FR-014 (RBAC semantics)**: Authorization MUST follow these semantics: - non-member / not entitled to workspace scope → 404 - member missing `ALERTS_VIEW` → 403 for viewing alert pages - member missing `ALERTS_MANAGE` → 403 for create/update/delete/enable/disable - **FR-015 (DB-only rendering)**: Alert management and delivery history pages MUST render without any outbound network requests. - **FR-016 (Retention)**: Delivery history MUST be retained for 90 days by default. - **FR-017 (Delivery retries)**: On delivery failure (Teams/email), the system MUST retry with exponential backoff up to a bounded maximum attempt limit; once the limit is reached, the delivery MUST be marked `failed`. ## UI Action Matrix *(mandatory when Filament is changed)* If this feature adds/modifies any Filament Resource / RelationManager / Page, fill out the matrix below. For each surface, list the exact action labels, whether they are destructive (confirmation? typed confirmation?), RBAC gating (capability + enforcement helper), and whether the mutation writes an audit log. | Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions | |---|---|---|---|---|---|---|---|---|---|---| | Alert Targets | Workspace → Monitoring → Alerts → Alert targets | Create target | Clickable row to Edit | Edit, More (Enable/Disable, Delete) | More (group rendered; no bulk mutations in v1) | Create target | None | Save, Cancel | Yes | Delete requires confirmation; secrets never shown/logged | | Alert Rules | Workspace → Monitoring → Alerts → Alert rules | Create rule | Clickable row to Edit | Edit, More (Enable/Disable, Delete) | More (group rendered; no bulk mutations in v1) | Create rule | None | Save, Cancel | Yes | Enable/Disable and Delete are audited; both require confirmation | | Alert Deliveries (read-only) | Workspace → Monitoring → Alerts → Alert deliveries | None | Clickable row to View | View | None | None | None | N/A | No | Read-only viewer; tenant entitlement filtering enforced | ### Key Entities *(include if feature involves data)* - **Alert Destination**: A workspace-defined place to send notifications (Teams or email), which can be enabled/disabled. - **Alert Rule**: A workspace-defined routing rule that decides which events should generate alerts and which destinations they should notify. - **Alert Event**: A notable system occurrence (e.g., high drift, compare failure, SLA due) that may generate alerts. - **Event Fingerprint**: A stable, deterministic identifier used to deduplicate repeated events per rule. - **Alert Delivery**: A record of a planned or attempted notification send, including scheduling (quiet hours), status, and timestamps. ## Success Criteria *(mandatory)* ### Measurable Outcomes - **SC-001 (Setup time)**: A workspace manager can create a destination and a rule and enable it in under 5 minutes. - **SC-002 (Delivery timeliness)**: Outside quiet hours, at least 95% of eligible alerts are delivered within 2 minutes of the triggering event. - **SC-003 (Noise control)**: Within a configured cooldown window, the same fingerprint does not generate more than one notification per rule. - **SC-004 (Security hygiene)**: No destination secrets appear in application logs or audit payloads during normal operation or error cases. - **SC-005 (Audit traceability)**: 100% of rule/destination create/update/enable/disable/delete actions are traceable via an audit record.