TenantAtlas/specs/099-alerts-v1-teams-email/spec.md

# Feature Specification: Alerts v1 (Teams + Email)

**Feature Branch**: `099-alerts-v1-teams-email`
**Created**: 2026-02-16
**Status**: Draft
**Input**: User description: "Alerts v1 (Microsoft Teams + Email)"

## Spec Scope Fields *(mandatory)*

- **Scope**: workspace
- **Primary Routes**:
  - Admin UI → Workspace → Monitoring → Alerts → Alert targets
  - Admin UI → Workspace → Monitoring → Alerts → Alert rules
  - Admin UI → Workspace → Monitoring → Alerts → Alert deliveries (read-only)
- **Data Ownership**:
  - Workspace-owned alert configuration (rules + destinations)
  - Tenant-owned alert delivery history (deliveries are always tenant-scoped)
  - Deliveries are surfaced via workspace-context canonical UI routes, but MUST only reveal deliveries for tenants the actor is entitled to
- **Authorization Planes**:
  - Admin panel `/admin` in **workspace-context** (workspace selected via session-based workspace context)
  - This feature does **not** introduce tenant-context UI routes (no `/admin/t/{tenant}/...` pages for Alerts v1)
- **RBAC**:
  - Workspace membership is required for any access (non-members are denied as not found / 404)
  - Viewing alert configuration/history requires `ALERTS_VIEW`
  - Creating/updating/enabling/disabling/deleting rules or destinations requires `ALERTS_MANAGE`
  - Members without `ALERTS_VIEW` receive 403 for view-only access attempts
  - Members without `ALERTS_MANAGE` receive 403 for mutation attempts; UI surfaces are disabled for them
  - Viewing deliveries additionally requires tenant entitlement for each delivery’s tenant (non-entitled tenants are filtered and treated as not found / 404 semantics)

## Clarifications

### Session 2026-02-16

- Q: Should viewing the Alerts pages (Targets / Rules / Deliveries) require `ALERTS_VIEW`, or is workspace membership alone enough to view? → A: Viewing requires `ALERTS_VIEW` (members without it get 403).
- Q: When an event is suppressed due to cooldown/dedupe, should the system still create an entry in the delivery history (status = `suppressed`)? → A: Yes, create delivery history entries with `status=suppressed` (no send attempted).
- Q: Should `operation.compare_failed` fire only for the single canonical run type (baseline compare), or should v1 allow a per-rule run-type allowlist? → A: Fixed: only baseline compare failures (single canonical run type).
- Q: For quiet hours evaluation, what timezone should be used when a rule does not specify a timezone? → A: Fallback to workspace timezone.
- Q: When a Teams/email delivery attempt fails, which retry policy should v1 use? → A: Retry with exponential backoff up to a max attempts limit; then mark `failed`.

### Assumptions & Dependencies

- Drift findings and operational run outcomes already exist in the system and can be evaluated for alert triggers.
- Events are attributable to a workspace and (where applicable) a tenant so rules can apply tenant scoping.
- SLA-due alerts only apply if the underlying finding data includes a due date; otherwise this trigger is a no-op.

## User Scenarios & Testing *(mandatory)*

<!--
  IMPORTANT: User stories should be PRIORITIZED as user journeys ordered by importance.
  Each user story/journey must be INDEPENDENTLY TESTABLE - meaning if you implement just ONE of them,
  you should still have a viable MVP (Minimum Viable Product) that delivers value.

  Assign priorities (P1, P2, P3, etc.) to each story, where P1 is the most critical.
  Think of each story as a standalone slice of functionality that can be:
  - Developed independently
  - Tested independently
  - Deployed independently
  - Demonstrated to users independently
-->

### User Story 1 - Configure alert destinations (Priority: P1)

As a workspace operator, I can define alert destinations (Microsoft Teams and/or email) that can later be reused by multiple alert rules.

**Why this priority**: Without destinations, alerts cannot be delivered; this is the smallest useful slice.

**Independent Test**: Create a destination, then confirm it is listed, viewable, and can be enabled/disabled.

**Acceptance Scenarios**:

1. **Given** I have `ALERTS_MANAGE`, **When** I create a Microsoft Teams destination with a name and webhook URL, **Then** the destination is saved and appears in the destinations list as enabled.
2. **Given** I have `ALERTS_MANAGE`, **When** I create an Email destination with one or more recipient addresses, **Then** the destination is saved and appears in the destinations list.
3. **Given** a destination exists, **When** I disable it, **Then** it is not used for future alert deliveries.

---

### User Story 2 - Configure alert routing rules (Priority: P2)

As a workspace manager, I can configure routing rules so that only relevant events (by type, severity, and tenant scope) generate alerts, and each rule can notify multiple destinations.

**Why this priority**: Rules provide control over noise, scope, and who gets notified.

**Independent Test**: Create a rule with at least one destination, then trigger one matching event and confirm exactly one delivery is queued per destination.

**Acceptance Scenarios**:

1. **Given** I have `ALERTS_MANAGE`, **When** I create a rule that matches a specific event type and minimum severity, **Then** the rule is saved and appears as enabled.
2. **Given** I configure a rule with tenant scope = allowlist, **When** an event from a non-allowlisted tenant occurs, **Then** no delivery is created for that rule.
3. **Given** a rule has multiple destinations assigned, **When** a matching event occurs, **Then** deliveries are created for each enabled destination.

---

### User Story 3 - Deliver alerts safely (dedupe, cooldown, quiet hours) and review history (Priority: P3)

As an operator, I receive timely notifications for important events without spam, and I can review what was sent (or failed) in a delivery history view.

**Why this priority**: Alert quality and traceability are essential for governance and incident response.

**Independent Test**: Trigger the same event twice within cooldown and confirm only one notification is sent; enable quiet hours and confirm delivery is deferred.

**Acceptance Scenarios**:

1. **Given** a rule has a cooldown configured, **When** the same event repeats within the cooldown window, **Then** later deliveries are suppressed for that rule.
2. **Given** quiet hours are enabled for a rule and the current time is within quiet hours (evaluated in the rule’s configured timezone or workspace timezone fallback), **When** a matching event occurs, **Then** a delivery is scheduled for the next allowed window rather than sent immediately.
3. **Given** I have `ALERTS_VIEW`, **When** I open the deliveries viewer, **Then** I can see delivery status and timestamps without exposing destination secrets.
4. **Given** an event is suppressed by cooldown, **When** I open the deliveries viewer, **Then** I can see a `suppressed` delivery entry that references the rule and destination (without exposing destination secrets).

---

### Edge Cases

- Quiet hours windows that cross midnight must still defer correctly to the next allowed time.
- Multiple background workers triggering the same event concurrently must not cause duplicate sends.
- A destination that is misconfigured (invalid webhook URL or invalid email address list) must fail safely and record a sanitized failure reason (no secrets).
- The UI must not make outbound network requests while rendering pages (no external calls during page load).
- SLA-due alerts are a no-op if the underlying data does not provide a due date yet (no errors; no false alerts).

## Requirements *(mandatory)*

**Constitution alignment (required):** If this feature introduces any Microsoft Graph calls, any write/change behavior,
or any long-running/queued/scheduled work, the spec MUST describe contract registry updates, safety gates
(preview/confirmation/audit), tenant isolation, run observability (`OperationRun` type/identity/visibility), and tests.
If security-relevant DB-only actions intentionally skip `OperationRun`, the spec MUST describe `AuditLog` entries.

**Constitution alignment (RBAC-UX):** If this feature introduces or changes authorization behavior, the spec MUST:
- state which authorization plane(s) are involved (tenant/admin `/admin` + tenant-context `/admin/t/{tenant}/...` vs platform `/system`),
- ensure any cross-plane access is deny-as-not-found (404),
- explicitly define 404 vs 403 semantics:
  - non-member / not entitled to workspace scope OR tenant scope → 404 (deny-as-not-found)
  - member but missing capability → 403
- describe how authorization is enforced server-side (Gates/Policies) for every mutation/operation-start/credential change,
- reference the canonical capability registry (no raw capability strings; no role-string checks in feature code),
- ensure global search is tenant-scoped and non-member-safe (no hints; inaccessible results treated as 404 semantics),
- ensure destructive-like actions require confirmation (`->requiresConfirmation()`),
- include at least one positive and one negative authorization test, and note any RBAC regression tests added/updated.

**Constitution alignment (OPS-EX-AUTH-001):** OIDC/SAML login handshakes may perform synchronous outbound HTTP (e.g., token exchange)
on `/auth/*` endpoints without an `OperationRun`. This MUST NOT be used for Monitoring/Operations pages.

**Constitution alignment (BADGE-001):** If this feature changes status-like badges (status/outcome/severity/risk/availability/boolean),
the spec MUST describe how badge semantics stay centralized (no ad-hoc mappings) and which tests cover any new/changed values.

**Constitution alignment (Filament Action Surfaces):** If this feature adds or modifies any Filament Resource / RelationManager / Page,
the spec MUST include a “UI Action Matrix” (see below) and explicitly state whether the Action Surface Contract is satisfied.
If the contract is not satisfied, the spec MUST include an explicit exemption with rationale.

### Functional Requirements

- **FR-001 (Channels)**: The system MUST support alert delivery via Microsoft Teams (workspace-configured webhook destination) and via email (one or more recipient addresses per destination).
- **FR-002 (Workspace scoping)**: The system MUST scope alert rules and destinations to a workspace; rules/destinations MUST NOT be shared across workspaces.
- **FR-003 (Routing rules)**: The system MUST allow rules to filter alert generation by:
  - event type
  - minimum severity
  - tenant scope (all tenants or allowlist)
- **FR-004 (Multiple destinations)**: The system MUST allow a rule to notify multiple destinations.
- **FR-005 (Event triggers v1)**: The system MUST support these trigger types:
  - High Drift: when a new drift finding first appears (i.e., a new drift finding is created) with severity High or Critical, and it is in an unacknowledged/new state
  - Compare Failed: when a baseline-compare operation run fails (fixed in v1; not configurable per rule)
  - SLA Due: when a finding passes its due date and remains unresolved (if due date data is available)
- **FR-006 (Idempotency / dedupe)**: The system MUST prevent duplicate notifications for repeated occurrences of the same event for a given rule, using a deterministic event fingerprint that contains no secrets.
- **FR-007 (Cooldown)**: The system MUST support a per-rule cooldown window during which repeated fingerprints are suppressed.
- **FR-007a (Suppression visibility)**: When a notification is suppressed by cooldown/dedupe, the system MUST persist an entry in delivery history with `status=suppressed`.
- **FR-008 (Quiet hours)**: The system MUST support optional quiet hours per rule; events during quiet hours MUST be deferred to the next allowed time window.
- **FR-009 (Quiet hours timezone)**: Quiet hours MUST be evaluated in the rule’s configured timezone; if not set, the system MUST fallback to the workspace timezone.
- **FR-010 (Delivery history)**: The system MUST retain a delivery history view showing, at minimum: status (queued/deferred/sent/failed/suppressed/canceled), timestamps, event type, severity, tenant association, and the rule + destination used.
- **FR-011 (Safe logging)**: The system MUST NOT persist destination secrets (webhook URLs, email recipient lists) in logs, error messages, or audit payloads.
- **FR-012 (Auditability)**: The system MUST write auditable events for creation, updates, enable/disable, and deletion of rules and destinations.
- **FR-013 (Operations observability)**: Background work that scans for due alerts and performs alert delivery MUST be observable as operations runs with outcome and timestamps, so operators can diagnose failures.
- **FR-014 (RBAC semantics)**: Authorization MUST follow these semantics:
  - non-member / not entitled to workspace scope → 404
  - member missing `ALERTS_VIEW` → 403 for viewing alert pages
  - member missing `ALERTS_MANAGE` → 403 for create/update/delete/enable/disable
- **FR-015 (DB-only rendering)**: Alert management and delivery history pages MUST render without any outbound network requests.
- **FR-016 (Retention)**: Delivery history MUST be retained for 90 days by default.
- **FR-017 (Delivery retries)**: On delivery failure (Teams/email), the system MUST retry with exponential backoff up to a bounded maximum attempt limit; once the limit is reached, the delivery MUST be marked `failed`.

## UI Action Matrix *(mandatory when Filament is changed)*

If this feature adds/modifies any Filament Resource / RelationManager / Page, fill out the matrix below.

For each surface, list the exact action labels, whether they are destructive (confirmation? typed confirmation?),
RBAC gating (capability + enforcement helper), and whether the mutation writes an audit log.

| Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions |
|---|---|---|---|---|---|---|---|---|---|---|
| Alert Targets | Workspace → Monitoring → Alerts → Alert targets | Create target | Clickable row to Edit | Edit, More (Enable/Disable, Delete) | More (group rendered; no bulk mutations in v1) | Create target | None | Save, Cancel | Yes | Delete requires confirmation; secrets never shown/logged |
| Alert Rules | Workspace → Monitoring → Alerts → Alert rules | Create rule | Clickable row to Edit | Edit, More (Enable/Disable, Delete) | More (group rendered; no bulk mutations in v1) | Create rule | None | Save, Cancel | Yes | Enable/Disable and Delete are audited; both require confirmation |
| Alert Deliveries (read-only) | Workspace → Monitoring → Alerts → Alert deliveries | None | Clickable row to View | View | None | None | None | N/A | No | Read-only viewer; tenant entitlement filtering enforced |

### Key Entities *(include if feature involves data)*

- **Alert Destination**: A workspace-defined place to send notifications (Teams or email), which can be enabled/disabled.
- **Alert Rule**: A workspace-defined routing rule that decides which events should generate alerts and which destinations they should notify.
- **Alert Event**: A notable system occurrence (e.g., high drift, compare failure, SLA due) that may generate alerts.
- **Event Fingerprint**: A stable, deterministic identifier used to deduplicate repeated events per rule.
- **Alert Delivery**: A record of a planned or attempted notification send, including scheduling (quiet hours), status, and timestamps.

## Success Criteria *(mandatory)*

### Measurable Outcomes

- **SC-001 (Setup time)**: A workspace manager can create a destination and a rule and enable it in under 5 minutes.
- **SC-002 (Delivery timeliness)**: Outside quiet hours, at least 95% of eligible alerts are delivered within 2 minutes of the triggering event.
- **SC-003 (Noise control)**: Within a configured cooldown window, the same fingerprint does not generate more than one notification per rule.
- **SC-004 (Security hygiene)**: No destination secrets appear in application logs or audit payloads during normal operation or error cases.
- **SC-005 (Audit traceability)**: 100% of rule/destination create/update/enable/disable/delete actions are traceable via an audit record.