TenantAtlas/specs/111-findings-workflow-sla/spec.md

# Feature Specification: Findings Workflow V2 + SLA

**Feature Branch**: `111-findings-workflow-sla`
**Created**: 2026-02-24
**Status**: Draft
**Depends On**: `specs/104-provider-permission-posture/spec.md`, `specs/105-entra-admin-roles-evidence-findings/spec.md`, `specs/109-review-pack-export/spec.md`
**Input**: Standardize the Findings lifecycle (workflow, ownership, recurrence, SLA due dates, and alerting) so findings management is enterprise-usable and not “noise”.

## Clarifications

### Session 2026-02-24

- Q: What should happen when the same finding is detected again, but its current status is terminal? → A: Auto-reopen only from `resolved`; `closed` and `risk_accepted` remain terminal (still update seen tracking fields).
- Q: When backfilling legacy open findings, how should the initial due date be set? → A: Compute from the backfill operation time (backfill time + SLA days).
- Q: When SLA due alerts fire, what should a single alert event represent? → A: At most one event per tenant per alert-evaluation window, emitted only when newly-overdue open findings exist; the event summarizes current overdue counts.
- Q: Which statuses should count as “Open” for the default Findings list and for SLA due evaluation? → A: Open = `new`, `triaged`, `in_progress`, `reopened`.
- Q: From which statuses should a user be able to manually “Reopen” a finding (into `reopened` status)? → A: Allow manual reopen from `resolved`, `closed`, and `risk_accepted`.
- Q: Where is the SLA policy configured, and what scope does it apply to? → A: Workspace-scoped setting (`findings.sla_days`) in Workspace Settings; applies to all tenants in the workspace.
- Q: How is the “alert-evaluation window” defined for SLA due gating? → A: Use the Alerts evaluation window start time (previous completed `alerts.evaluate` OperationRun `completed_at`; fallback to initial lookback). “Newly overdue” means `due_at` in `(window_start, now]` for open findings.
- Q: What must an `sla_due` event contain? → A: One event per tenant per evaluation window; `metadata` includes `overdue_total` and `overdue_by_severity` (critical/high/medium/low) for currently overdue open findings; fingerprint is stable per tenant+window.
- Q: If severity changes while a finding remains open, should `due_at` be recalculated? → A: No — `due_at` is set on create and reset only on reopen/backfill.
- Q: If a user resolves a finding while a detection run is processing, how is consistency maintained? → A: Detection updates may still advance seen counters, but automatic reopen MUST occur only when the observation time is after `resolved_at`.

## Spec Scope Fields *(mandatory)*

- **Scope**: tenant (Findings management) + workspace (SLA policy + Alert rules configuration)
- **Primary Routes**:
  - Tenant-context: Findings list + view (`/admin/t/{tenant}/...`)
  - Workspace-context Monitoring: Alert rules list + edit (`/admin/...`)
  - Workspace-context Settings: Workspace Settings (Findings SLA policy) (`/admin/...`)
- **Data Ownership**:
  - Tenant-owned: Findings and their lifecycle metadata
  - Workspace-owned: SLA policy settings (`findings.sla_days`)
  - Workspace-owned: Alert rules configuration (event types)
- **RBAC**:
  - Findings view + workflow actions are tenant-context capability-gated
  - Workspace Settings + Alert rules remain workspace capability/policy-gated (existing behavior)

*Canonical-view fields not applicable — this spec updates tenant-context Findings and workspace-scoped Alert Rules.*

## User Scenarios & Testing *(mandatory)*

### User Story 1 - See Open Findings (Priority: P1)

As a tenant operator, I can open the Findings page and immediately see the current open findings across all finding types, so I don’t miss non-drift issues and can focus on what needs attention now.

**Why this priority**: If open findings are hidden by default filters or type assumptions, findings become unreliable as an operational surface.

**Independent Test**: Seed a tenant with findings across multiple types and statuses, then verify the default list shows open workflow statuses across all types without adjusting filters.

**Acceptance Scenarios**:

1. **Given** a tenant has findings of types drift, permission posture, and Entra admin roles, **When** I open the Findings list, **Then** I can see open findings from all types without changing any filters.
2. **Given** a tenant has a mix of open and terminal findings, **When** I open the Findings list, **Then** the default list shows only open workflow statuses.
3. **Given** a tenant has overdue findings, **When** I use the “Overdue” quick filter, **Then** only findings past their due date are shown.
4. **Given** a tenant has open findings, **When** I view the list, **Then** I can see each finding’s status, severity, due date, and assignee (when set).

---

### User Story 2 - Triage, Assign, And Resolve (Priority: P1)

As a tenant manager, I can triage findings, assign ownership, and move findings through a consistent workflow (including reasons and auditability), so the team can reliably manage remediation.

**Why this priority**: Without a consistent workflow and ownership, findings degrade into noisy, un-actioned rows with unclear accountability.

**Independent Test**: Create an open finding, execute each allowed status transition, and verify transitions are enforced server-side, recorded with timestamps/actors, and audited.

**Acceptance Scenarios**:

1. **Given** a finding in `new` (or `reopened`) status, **When** I triage it, **Then** the status becomes `triaged` and the triage timestamp is recorded.
2. **Given** a finding in `triaged` status, **When** I start progress, **Then** the status becomes `in_progress` and the progress timestamp is recorded.
3. **Given** a finding in an open status, **When** I assign an assignee (and optional owner), **Then** those fields are saved and displayed on the finding.
4. **Given** a finding in an open status, **When** I resolve it with a resolution reason, **Then** it becomes `resolved` and the resolution reason is persisted.
5. **Given** a finding in any status, **When** I close it with a close reason, **Then** it becomes `closed` and the close reason is persisted.
6. **Given** a finding in any status, **When** I mark it as risk accepted with a reason, **Then** it becomes `risk_accepted` and the reason is persisted.
7. **Given** a user without the relevant capability, **When** they attempt any workflow mutation, **Then** the server denies it (403 for members lacking capability; 404 for non-members / not entitled).

---

### User Story 3 - SLA Due Visibility And Alerts (Priority: P1)

As a workspace operator, I can configure alerting for findings that are past their due date (SLA due), so overdue findings reliably escalate beyond the Findings page.

**Why this priority**: An SLA without alerting becomes “best effort” and is easy to ignore in busy operations.

**Independent Test**: Create newly-overdue open findings for a tenant, run alert evaluation, and verify a single tenant-level SLA due event is produced and can match an enabled alert rule.

**Acceptance Scenarios**:

1. **Given** a tenant has one or more newly-overdue open findings since the previous evaluation window, **When** alert evaluation runs, **Then** exactly one SLA due event is produced for that tenant and can trigger an enabled alert rule.
2. **Given** a tenant has no overdue open findings (including when only terminal findings have past due dates), **When** alert evaluation runs, **Then** no SLA due event is produced for that tenant.
3. **Given** I edit an alert rule, **When** I choose the event type, **Then** “SLA due” is available as a selectable event type.
4. **Given** a tenant has overdue open findings but no newly-overdue open findings since the previous evaluation window, **When** alert evaluation runs, **Then** no additional SLA due event is produced for that tenant.
5. **Given** an SLA due event is produced, **When** I inspect the event payload, **Then** it includes overdue counts total and by severity.

---

### User Story 4 - Recurrence Reopens (Priority: P2)

As a tenant operator, when a previously resolved finding reappears in later detection runs, it reopens the original finding (instead of creating a new duplicate), so recurrence is visible and manageable.

**Why this priority**: Recurrence is operationally important, and duplicate rows create confusion and reporting noise.

**Independent Test**: Simulate a finding being resolved and then being detected again, verifying it transitions to `reopened`, counters update, and due date resets.

**Acceptance Scenarios**:

1. **Given** a finding was `resolved`, **When** it is detected again, **Then** the same finding transitions to `reopened` and records a reopened timestamp.
2. **Given** a finding is detected in successive runs, **When** it appears again, **Then** the last-seen timestamp updates and the seen counter increases.
3. **Given** a drift finding is no longer detected in the latest run, **When** stale detection is evaluated, **Then** the drift finding is auto-resolved with reason “no longer detected”.
4. **Given** a finding is `closed` or `risk_accepted`, **When** it is detected again, **Then** it remains terminal and only its seen tracking fields update.

---

### User Story 5 - Bulk Manage Findings (Priority: P3)

As a tenant manager, I can triage/assign/resolve/close findings in bulk, so I can manage high volumes efficiently while preserving auditability and safety.

**Why this priority**: Bulk workflow reduces operational load, but can ship after the single-record workflow is correct.

**Independent Test**: Select multiple findings and run each bulk action, verifying that all selected findings update consistently and each change is audited.

**Acceptance Scenarios**:

1. **Given** I select multiple open findings, **When** I bulk triage them, **Then** all selected findings become `triaged`.
2. **Given** I select multiple open findings, **When** I bulk assign an assignee, **Then** all selected findings are assigned.
3. **Given** I select multiple open findings, **When** I bulk resolve them with a reason, **Then** all selected findings become `resolved` and record the reason.
4. **Given** I select multiple open findings, **When** I bulk close them with a reason, **Then** all selected findings become `closed` and record the close reason.
5. **Given** I select multiple open findings, **When** I bulk risk accept them with a reason, **Then** all selected findings become `risk_accepted` and record the reason.
6. **Given** more than 100 open findings match my current filters, **When** I run “Triage all matching”, **Then** the action requires typed confirmation, updates all matching findings safely, and audits each change.

---

### User Story 6 - Backfill Existing Findings (Priority: P2)

As a tenant operator, I can run a one-time backfill/consolidation operation to upgrade existing findings into the v2 workflow model, so older data is usable (due dates, counters, recurrence) without manual cleanup.

**Why this priority**: Without backfill, existing tenants keep legacy/incomplete findings and the new workflow appears inconsistent or broken.

**Independent Test**: Seed legacy findings (missing lifecycle fields, `acknowledged` status, drift duplicates), run the backfill operation, and verify fields are populated, statuses are mapped, and duplicates are consolidated.

**Acceptance Scenarios**:

1. **Given** legacy open findings exist without due dates or lifecycle timestamps, **When** I run the backfill operation, **Then** open findings receive due dates set to the backfill operation time plus the SLA days for their severity, and lifecycle metadata is populated.
2. **Given** legacy findings in `acknowledged` status exist, **When** I run the backfill operation, **Then** they appear as `triaged` in the v2 workflow surface.
3. **Given** duplicate drift findings exist for the same recurring issue, **When** I run the backfill operation, **Then** duplicates are consolidated so only one canonical open finding remains.

---

### Edge Cases

- Legacy findings exist without lifecycle timestamps or due dates (backfill required).
- A previously assigned/owned user is no longer a tenant member (retain historical assignment, but prevent selecting non-members for new assignments).
- A finding’s severity changes while it remains open (assumption on due date recalculation documented below).
- An SLA due alert rule exists from earlier versions (should begin working once the producer exists; no data loss).
- Concurrent actions: a user resolves a finding while a detection run marks it seen again (system remains consistent and auditable).

## Requirements *(mandatory)*

### Governance And Safety Requirements

- This feature introduces no new external API calls.
- All user-initiated workflow mutations (triage/assign/resolve/close/risk accept/reopen) MUST be audited with actor, tenant, action, target, before/after, and timestamp.
  - Audit before/after MUST be limited to workflow/assignment metadata (e.g., `status`, `severity`, `due_at`, `assignee_id`, `owner_id`, `triaged_at`, `in_progress_at`, `resolved_at`, `closed_at`, `resolution_reason`, `close_reason`, `risk_accepted_reason`) and MUST NOT include raw evidence payloads or secrets/tokens.
- The lifecycle backfill/consolidation operation MUST be observable as an operation with:
  - clear start feedback (accepted/queued),
  - progress visibility while running, and
  - a single terminal outcome notification for the initiator.
- Authorization MUST be enforced server-side for every mutation with deny-as-not-found semantics:
  - non-members or users not entitled to the tenant scope → 404
  - members missing capability → 403
- Destructive-like actions (resolve/close/risk accept) MUST require explicit confirmation.
- Findings status badge semantics MUST remain centralized and cover every allowed status.

### Functional Requirements

- **FR-001**: System MUST support a Findings lifecycle with statuses: `new`, `triaged`, `in_progress`, `reopened`, `resolved`, `closed`, `risk_accepted`.
- **FR-002**: System MUST enforce allowed status transitions server-side:
  - `new|reopened` → `triaged`
  - `triaged` → `in_progress`
  - `new|reopened|triaged|in_progress` → `resolved` (resolution reason required)
  - `resolved|closed|risk_accepted` → `reopened` (manual allowed; requires confirmation; automatic only when detected again from `resolved`)
  - `*` → `closed` (close reason required)
  - `*` → `risk_accepted` (reason required)
- **FR-003**: Each finding MUST track lifecycle metadata: owner, assignee, first-seen time, last-seen time, seen count, and (when open) an SLA due date.
- **FR-004**: The system MUST assign an SLA due date to open findings using a configurable severity-based policy with defaults:
  - critical: 3 days
  - high: 7 days
  - medium: 14 days
  - low: 30 days
- **FR-005**: When a finding reopens (automatic or manual), the system MUST reset the SLA due date based on the current severity-based SLA policy.
- **FR-006**: SLA due alerting MUST exist:
  - “SLA due” MUST be available as an alert rule event type (`sla_due`).
  - The SLA due producer MUST use the same alert-evaluation window start time (`window_start`) used by Alerts evaluation (previous completed `alerts.evaluate` OperationRun `completed_at`; fallback to initial lookback).
  - “Newly overdue” means: an open finding with `due_at` in `(window_start, now]`.
  - The system MUST emit exactly one SLA due event per tenant per alert-evaluation window when that tenant has one or more newly-overdue open findings since `window_start`.
  - Each SLA due event MUST summarize current overdue open findings for the tenant and include:
    - `overdue_total` (count)
    - `overdue_by_severity` (`critical`, `high`, `medium`, `low`)
  - A tenant with persistently overdue open findings MUST NOT emit repeated SLA due events on every evaluation run unless additional findings become newly overdue.
  - Terminal statuses (`resolved`, `closed`, `risk_accepted`) MUST NOT contribute to the overdue counts.
  - Open workflow statuses are `new`, `triaged`, `in_progress`, `reopened`.
  - The event’s `fingerprint_key` MUST be stable per tenant + alert-evaluation window for idempotency.
- **FR-007**: The system MUST track recurrence:
  - When a previously `resolved` finding is detected again, it MUST transition to `reopened` (not create a duplicate open finding for the same recurring issue).
  - When a `closed` or `risk_accepted` finding is detected again, it MUST NOT change status automatically; it only updates seen tracking fields.
  - Each detection run where the finding is observed MUST update last-seen time and increment seen count.
  - Concurrency safety: automatic reopen MUST occur only when the observation time is after the finding’s `resolved_at`.
- **FR-008**: Drift findings MUST avoid “new row per re-drift” noise by using a stable recurrence identity so recurring drift reopens the canonical finding.
- **FR-009**: Drift findings MUST auto-resolve when they are no longer detected in the latest run, with a consistent resolved reason (e.g., “no longer detected”).
- **FR-010**: Findings list defaults MUST be safe and visible:
  - Default list shows open statuses (`new`, `triaged`, `in_progress`, `reopened`) across all finding types (no drift-only default).
  - Quick filters exist for: Open, Overdue, High severity, My assigned.
- **FR-011**: Findings UI MUST provide safe workflow actions:
  - Single-record actions: triage, start progress, assign (assignee and optional owner), resolve (reason required), close (reason required), risk accept (reason required), reopen (where allowed).
  - Bulk actions: bulk triage, bulk assign, bulk resolve, bulk close, bulk risk accept.
- **FR-012**: The system MUST introduce tenant-context capabilities for Findings management:
  - `TENANT_FINDINGS_VIEW`
  - `TENANT_FINDINGS_TRIAGE`
  - `TENANT_FINDINGS_ASSIGN`
  - `TENANT_FINDINGS_RESOLVE`
  - `TENANT_FINDINGS_CLOSE`
  - `TENANT_FINDINGS_RISK_ACCEPT`
- **FR-013**: Assignment/ownership selection MUST be limited to users who are currently tenant members, while preserving historical assignment/ownership values for already-assigned findings.
- **FR-014**: Legacy compatibility MUST be maintained:
  - Existing `acknowledged` status MUST be treated as `triaged` in the v2 workflow surface.
  - Existing `TENANT_FINDINGS_ACKNOWLEDGE` capability MUST act as a deprecated alias for v2 triage permission.
- **FR-015**: A backfill/consolidation operation MUST exist to migrate existing findings to the v2 lifecycle model, including:
  - mapping `acknowledged` → `triaged`
  - populating lifecycle timestamps and seen counters for existing data
  - setting due dates for legacy open findings based on the backfill operation time (backfill time + SLA days)
  - consolidating duplicates where recurrence identity indicates the same recurring finding (canonical record retained; duplicates marked terminal with a consistent reason, e.g. `consolidated_duplicate`)
- **FR-016**: Severity changes while a finding remains open MUST NOT retroactively change `due_at`. `due_at` is assigned on create and reset only on reopen/backfill.
- **FR-017**: Review pack generation MUST treat “open findings” using the v2 open-status set (not drift-only defaults) to keep existing exports/review packs consistent.

## UI Action Matrix *(mandatory when Filament is changed)*

Action Surface Contract: Satisfied for Findings and Alert Rules (explicit exemptions noted).

| Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions |
|---|---|---|---|---|---|---|---|---|---|---|
| Findings Resource | Admin UI: Findings | Optional: “Triage all matching” (capability-gated) | View action | View, More | Bulk triage, bulk assign, bulk resolve, bulk close, bulk risk accept (under More) | None | Triage, Start progress, Assign, Resolve, Close, Risk accept, Reopen (where allowed) | N/A | Yes | Empty-state exemption: findings are system-generated; no create CTA |
| Alert Rules Resource | Monitoring UI: Alert rules | Create (capability/policy-gated) | Clickable row | Edit, More | None (exempt) | Create alert rule | N/A (edit surface) | Save/Cancel | Yes | “SLA due” event type is available once the producer exists |

### Key Entities *(include if feature involves data)*

- **Finding**: Represents a detected issue for a tenant, including type, severity, lifecycle status, recurrence behavior, and lifecycle metadata (ownership, due date, seen tracking).
- **SLA policy**: Severity-based due-date expectations applied to open findings, with configurable defaults.
- **Alert rule**: Workspace-defined routing rules that can trigger delivery when an SLA due event occurs.
- **Audit entry**: Immutable record of user-initiated workflow changes for accountability and compliance.

## Success Criteria *(mandatory)*

### Measurable Outcomes

- **SC-001**: 100% of open findings have a computed due date (SLA) at creation and after any reopen event.
- **SC-002**: Recurring findings reopen instead of creating duplicate open rows for the same recurring issue.
- **SC-003**: The default Findings list shows open findings across all finding types without requiring users to remove type-specific filters.
- **SC-004**: SLA due alerting is functional: tenants with newly-overdue open findings since the previous evaluation window can trigger alert rules and produce at most one SLA due event per tenant per evaluation window; terminal findings never contribute to SLA due alerts.
- **SC-005**: Authorization behavior is correct and non-enumerable: non-members receive 404; members missing capability receive 403.
- **SC-006**: Admins can triage/assign/resolve/close findings in bulk for at least 100 findings in a single action without needing per-record manual updates.

## Assumptions

- `risk_accepted` is a workflow status only in v2 (no expiry model in this feature).
- SLA due dates are set on create and on reopen. Severity changes while a finding remains open do not retroactively change the existing due date unless the finding is reopened.
- Backfill sets due dates for legacy open findings from the backfill operation time (backfill time + SLA days) to avoid an immediate “overdue” surge on rollout.
- Assignment/ownership pickers show only current tenant members, but historical assignments remain visible for audit/history even if membership is later removed.
- Existing alert rules with `event_type = sla_due` are preserved and should become effective once the SLA due producer is implemented (no destructive data migration of workspace-owned alert rules).