# Feature Specification: Findings Workflow V2 + SLA **Feature Branch**: `111-findings-workflow-sla` **Created**: 2026-02-24 **Status**: Draft **Depends On**: `specs/104-provider-permission-posture/spec.md`, `specs/105-entra-admin-roles-evidence-findings/spec.md`, `specs/109-review-pack-export/spec.md` **Input**: Standardize the Findings lifecycle (workflow, ownership, recurrence, SLA due dates, and alerting) so findings management is enterprise-usable and not “noise”. ## Clarifications ### Session 2026-02-24 - Q: What should happen when the same finding is detected again, but its current status is terminal? → A: Auto-reopen only from `resolved`; `closed` and `risk_accepted` remain terminal (still update seen tracking fields). - Q: When backfilling legacy open findings, how should the initial due date be set? → A: Compute from the backfill operation time (backfill time + SLA days). - Q: When SLA due alerts fire, what should a single alert event represent? → A: At most one event per tenant per alert-evaluation window, emitted only when newly-overdue open findings exist; the event summarizes current overdue counts. - Q: Which statuses should count as “Open” for the default Findings list and for SLA due evaluation? → A: Open = `new`, `triaged`, `in_progress`, `reopened`. - Q: From which statuses should a user be able to manually “Reopen” a finding (into `reopened` status)? → A: Allow manual reopen from `resolved`, `closed`, and `risk_accepted`. - Q: Where is the SLA policy configured, and what scope does it apply to? → A: Workspace-scoped setting (`findings.sla_days`) in Workspace Settings; applies to all tenants in the workspace. - Q: How is the “alert-evaluation window” defined for SLA due gating? → A: Use the Alerts evaluation window start time (previous completed `alerts.evaluate` OperationRun `completed_at`; fallback to initial lookback). “Newly overdue” means `due_at` in `(window_start, now]` for open findings. - Q: What must an `sla_due` event contain? → A: One event per tenant per evaluation window; `metadata` includes `overdue_total` and `overdue_by_severity` (critical/high/medium/low) for currently overdue open findings; fingerprint is stable per tenant+window. - Q: If severity changes while a finding remains open, should `due_at` be recalculated? → A: No — `due_at` is set on create and reset only on reopen/backfill. - Q: If a user resolves a finding while a detection run is processing, how is consistency maintained? → A: Detection updates may still advance seen counters, but automatic reopen MUST occur only when the observation time is after `resolved_at`. ## Spec Scope Fields *(mandatory)* - **Scope**: tenant (Findings management) + workspace (SLA policy + Alert rules configuration) - **Primary Routes**: - Tenant-context: Findings list + view (`/admin/t/{tenant}/...`) - Workspace-context Monitoring: Alert rules list + edit (`/admin/...`) - Workspace-context Settings: Workspace Settings (Findings SLA policy) (`/admin/...`) - **Data Ownership**: - Tenant-owned: Findings and their lifecycle metadata - Workspace-owned: SLA policy settings (`findings.sla_days`) - Workspace-owned: Alert rules configuration (event types) - **RBAC**: - Findings view + workflow actions are tenant-context capability-gated - Workspace Settings + Alert rules remain workspace capability/policy-gated (existing behavior) *Canonical-view fields not applicable — this spec updates tenant-context Findings and workspace-scoped Alert Rules.* ## User Scenarios & Testing *(mandatory)* ### User Story 1 - See Open Findings (Priority: P1) As a tenant operator, I can open the Findings page and immediately see the current open findings across all finding types, so I don’t miss non-drift issues and can focus on what needs attention now. **Why this priority**: If open findings are hidden by default filters or type assumptions, findings become unreliable as an operational surface. **Independent Test**: Seed a tenant with findings across multiple types and statuses, then verify the default list shows open workflow statuses across all types without adjusting filters. **Acceptance Scenarios**: 1. **Given** a tenant has findings of types drift, permission posture, and Entra admin roles, **When** I open the Findings list, **Then** I can see open findings from all types without changing any filters. 2. **Given** a tenant has a mix of open and terminal findings, **When** I open the Findings list, **Then** the default list shows only open workflow statuses. 3. **Given** a tenant has overdue findings, **When** I use the “Overdue” quick filter, **Then** only findings past their due date are shown. 4. **Given** a tenant has open findings, **When** I view the list, **Then** I can see each finding’s status, severity, due date, and assignee (when set). --- ### User Story 2 - Triage, Assign, And Resolve (Priority: P1) As a tenant manager, I can triage findings, assign ownership, and move findings through a consistent workflow (including reasons and auditability), so the team can reliably manage remediation. **Why this priority**: Without a consistent workflow and ownership, findings degrade into noisy, un-actioned rows with unclear accountability. **Independent Test**: Create an open finding, execute each allowed status transition, and verify transitions are enforced server-side, recorded with timestamps/actors, and audited. **Acceptance Scenarios**: 1. **Given** a finding in `new` (or `reopened`) status, **When** I triage it, **Then** the status becomes `triaged` and the triage timestamp is recorded. 2. **Given** a finding in `triaged` status, **When** I start progress, **Then** the status becomes `in_progress` and the progress timestamp is recorded. 3. **Given** a finding in an open status, **When** I assign an assignee (and optional owner), **Then** those fields are saved and displayed on the finding. 4. **Given** a finding in an open status, **When** I resolve it with a resolution reason, **Then** it becomes `resolved` and the resolution reason is persisted. 5. **Given** a finding in any status, **When** I close it with a close reason, **Then** it becomes `closed` and the close reason is persisted. 6. **Given** a finding in any status, **When** I mark it as risk accepted with a reason, **Then** it becomes `risk_accepted` and the reason is persisted. 7. **Given** a user without the relevant capability, **When** they attempt any workflow mutation, **Then** the server denies it (403 for members lacking capability; 404 for non-members / not entitled). --- ### User Story 3 - SLA Due Visibility And Alerts (Priority: P1) As a workspace operator, I can configure alerting for findings that are past their due date (SLA due), so overdue findings reliably escalate beyond the Findings page. **Why this priority**: An SLA without alerting becomes “best effort” and is easy to ignore in busy operations. **Independent Test**: Create newly-overdue open findings for a tenant, run alert evaluation, and verify a single tenant-level SLA due event is produced and can match an enabled alert rule. **Acceptance Scenarios**: 1. **Given** a tenant has one or more newly-overdue open findings since the previous evaluation window, **When** alert evaluation runs, **Then** exactly one SLA due event is produced for that tenant and can trigger an enabled alert rule. 2. **Given** a tenant has no overdue open findings (including when only terminal findings have past due dates), **When** alert evaluation runs, **Then** no SLA due event is produced for that tenant. 3. **Given** I edit an alert rule, **When** I choose the event type, **Then** “SLA due” is available as a selectable event type. 4. **Given** a tenant has overdue open findings but no newly-overdue open findings since the previous evaluation window, **When** alert evaluation runs, **Then** no additional SLA due event is produced for that tenant. 5. **Given** an SLA due event is produced, **When** I inspect the event payload, **Then** it includes overdue counts total and by severity. --- ### User Story 4 - Recurrence Reopens (Priority: P2) As a tenant operator, when a previously resolved finding reappears in later detection runs, it reopens the original finding (instead of creating a new duplicate), so recurrence is visible and manageable. **Why this priority**: Recurrence is operationally important, and duplicate rows create confusion and reporting noise. **Independent Test**: Simulate a finding being resolved and then being detected again, verifying it transitions to `reopened`, counters update, and due date resets. **Acceptance Scenarios**: 1. **Given** a finding was `resolved`, **When** it is detected again, **Then** the same finding transitions to `reopened` and records a reopened timestamp. 2. **Given** a finding is detected in successive runs, **When** it appears again, **Then** the last-seen timestamp updates and the seen counter increases. 3. **Given** a drift finding is no longer detected in the latest run, **When** stale detection is evaluated, **Then** the drift finding is auto-resolved with reason “no longer detected”. 4. **Given** a finding is `closed` or `risk_accepted`, **When** it is detected again, **Then** it remains terminal and only its seen tracking fields update. --- ### User Story 5 - Bulk Manage Findings (Priority: P3) As a tenant manager, I can triage/assign/resolve/close findings in bulk, so I can manage high volumes efficiently while preserving auditability and safety. **Why this priority**: Bulk workflow reduces operational load, but can ship after the single-record workflow is correct. **Independent Test**: Select multiple findings and run each bulk action, verifying that all selected findings update consistently and each change is audited. **Acceptance Scenarios**: 1. **Given** I select multiple open findings, **When** I bulk triage them, **Then** all selected findings become `triaged`. 2. **Given** I select multiple open findings, **When** I bulk assign an assignee, **Then** all selected findings are assigned. 3. **Given** I select multiple open findings, **When** I bulk resolve them with a reason, **Then** all selected findings become `resolved` and record the reason. 4. **Given** I select multiple open findings, **When** I bulk close them with a reason, **Then** all selected findings become `closed` and record the close reason. 5. **Given** I select multiple open findings, **When** I bulk risk accept them with a reason, **Then** all selected findings become `risk_accepted` and record the reason. 6. **Given** more than 100 open findings match my current filters, **When** I run “Triage all matching”, **Then** the action requires typed confirmation, updates all matching findings safely, and audits each change. --- ### User Story 6 - Backfill Existing Findings (Priority: P2) As a tenant operator, I can run a one-time backfill/consolidation operation to upgrade existing findings into the v2 workflow model, so older data is usable (due dates, counters, recurrence) without manual cleanup. **Why this priority**: Without backfill, existing tenants keep legacy/incomplete findings and the new workflow appears inconsistent or broken. **Independent Test**: Seed legacy findings (missing lifecycle fields, `acknowledged` status, drift duplicates), run the backfill operation, and verify fields are populated, statuses are mapped, and duplicates are consolidated. **Acceptance Scenarios**: 1. **Given** legacy open findings exist without due dates or lifecycle timestamps, **When** I run the backfill operation, **Then** open findings receive due dates set to the backfill operation time plus the SLA days for their severity, and lifecycle metadata is populated. 2. **Given** legacy findings in `acknowledged` status exist, **When** I run the backfill operation, **Then** they appear as `triaged` in the v2 workflow surface. 3. **Given** duplicate drift findings exist for the same recurring issue, **When** I run the backfill operation, **Then** duplicates are consolidated so only one canonical open finding remains. --- ### Edge Cases - Legacy findings exist without lifecycle timestamps or due dates (backfill required). - A previously assigned/owned user is no longer a tenant member (retain historical assignment, but prevent selecting non-members for new assignments). - A finding’s severity changes while it remains open (assumption on due date recalculation documented below). - An SLA due alert rule exists from earlier versions (should begin working once the producer exists; no data loss). - Concurrent actions: a user resolves a finding while a detection run marks it seen again (system remains consistent and auditable). ## Requirements *(mandatory)* ### Governance And Safety Requirements - This feature introduces no new external API calls. - All user-initiated workflow mutations (triage/assign/resolve/close/risk accept/reopen) MUST be audited with actor, tenant, action, target, before/after, and timestamp. - Audit before/after MUST be limited to workflow/assignment metadata (e.g., `status`, `severity`, `due_at`, `assignee_id`, `owner_id`, `triaged_at`, `in_progress_at`, `resolved_at`, `closed_at`, `resolution_reason`, `close_reason`, `risk_accepted_reason`) and MUST NOT include raw evidence payloads or secrets/tokens. - The lifecycle backfill/consolidation operation MUST be observable as an operation with: - clear start feedback (accepted/queued), - progress visibility while running, and - a single terminal outcome notification for the initiator. - Authorization MUST be enforced server-side for every mutation with deny-as-not-found semantics: - non-members or users not entitled to the tenant scope → 404 - members missing capability → 403 - Destructive-like actions (resolve/close/risk accept) MUST require explicit confirmation. - Findings status badge semantics MUST remain centralized and cover every allowed status. ### Functional Requirements - **FR-001**: System MUST support a Findings lifecycle with statuses: `new`, `triaged`, `in_progress`, `reopened`, `resolved`, `closed`, `risk_accepted`. - **FR-002**: System MUST enforce allowed status transitions server-side: - `new|reopened` → `triaged` - `triaged` → `in_progress` - `new|reopened|triaged|in_progress` → `resolved` (resolution reason required) - `resolved|closed|risk_accepted` → `reopened` (manual allowed; requires confirmation; automatic only when detected again from `resolved`) - `*` → `closed` (close reason required) - `*` → `risk_accepted` (reason required) - **FR-003**: Each finding MUST track lifecycle metadata: owner, assignee, first-seen time, last-seen time, seen count, and (when open) an SLA due date. - **FR-004**: The system MUST assign an SLA due date to open findings using a configurable severity-based policy with defaults: - critical: 3 days - high: 7 days - medium: 14 days - low: 30 days - **FR-005**: When a finding reopens (automatic or manual), the system MUST reset the SLA due date based on the current severity-based SLA policy. - **FR-006**: SLA due alerting MUST exist: - “SLA due” MUST be available as an alert rule event type (`sla_due`). - The SLA due producer MUST use the same alert-evaluation window start time (`window_start`) used by Alerts evaluation (previous completed `alerts.evaluate` OperationRun `completed_at`; fallback to initial lookback). - “Newly overdue” means: an open finding with `due_at` in `(window_start, now]`. - The system MUST emit exactly one SLA due event per tenant per alert-evaluation window when that tenant has one or more newly-overdue open findings since `window_start`. - Each SLA due event MUST summarize current overdue open findings for the tenant and include: - `overdue_total` (count) - `overdue_by_severity` (`critical`, `high`, `medium`, `low`) - A tenant with persistently overdue open findings MUST NOT emit repeated SLA due events on every evaluation run unless additional findings become newly overdue. - Terminal statuses (`resolved`, `closed`, `risk_accepted`) MUST NOT contribute to the overdue counts. - Open workflow statuses are `new`, `triaged`, `in_progress`, `reopened`. - The event’s `fingerprint_key` MUST be stable per tenant + alert-evaluation window for idempotency. - **FR-007**: The system MUST track recurrence: - When a previously `resolved` finding is detected again, it MUST transition to `reopened` (not create a duplicate open finding for the same recurring issue). - When a `closed` or `risk_accepted` finding is detected again, it MUST NOT change status automatically; it only updates seen tracking fields. - Each detection run where the finding is observed MUST update last-seen time and increment seen count. - Concurrency safety: automatic reopen MUST occur only when the observation time is after the finding’s `resolved_at`. - **FR-008**: Drift findings MUST avoid “new row per re-drift” noise by using a stable recurrence identity so recurring drift reopens the canonical finding. - **FR-009**: Drift findings MUST auto-resolve when they are no longer detected in the latest run, with a consistent resolved reason (e.g., “no longer detected”). - **FR-010**: Findings list defaults MUST be safe and visible: - Default list shows open statuses (`new`, `triaged`, `in_progress`, `reopened`) across all finding types (no drift-only default). - Quick filters exist for: Open, Overdue, High severity, My assigned. - **FR-011**: Findings UI MUST provide safe workflow actions: - Single-record actions: triage, start progress, assign (assignee and optional owner), resolve (reason required), close (reason required), risk accept (reason required), reopen (where allowed). - Bulk actions: bulk triage, bulk assign, bulk resolve, bulk close, bulk risk accept. - **FR-012**: The system MUST introduce tenant-context capabilities for Findings management: - `TENANT_FINDINGS_VIEW` - `TENANT_FINDINGS_TRIAGE` - `TENANT_FINDINGS_ASSIGN` - `TENANT_FINDINGS_RESOLVE` - `TENANT_FINDINGS_CLOSE` - `TENANT_FINDINGS_RISK_ACCEPT` - **FR-013**: Assignment/ownership selection MUST be limited to users who are currently tenant members, while preserving historical assignment/ownership values for already-assigned findings. - **FR-014**: Legacy compatibility MUST be maintained: - Existing `acknowledged` status MUST be treated as `triaged` in the v2 workflow surface. - Existing `TENANT_FINDINGS_ACKNOWLEDGE` capability MUST act as a deprecated alias for v2 triage permission. - **FR-015**: A backfill/consolidation operation MUST exist to migrate existing findings to the v2 lifecycle model, including: - mapping `acknowledged` → `triaged` - populating lifecycle timestamps and seen counters for existing data - setting due dates for legacy open findings based on the backfill operation time (backfill time + SLA days) - consolidating duplicates where recurrence identity indicates the same recurring finding (canonical record retained; duplicates marked terminal with a consistent reason, e.g. `consolidated_duplicate`) - **FR-016**: Severity changes while a finding remains open MUST NOT retroactively change `due_at`. `due_at` is assigned on create and reset only on reopen/backfill. - **FR-017**: Review pack generation MUST treat “open findings” using the v2 open-status set (not drift-only defaults) to keep existing exports/review packs consistent. ## UI Action Matrix *(mandatory when Filament is changed)* Action Surface Contract: Satisfied for Findings and Alert Rules (explicit exemptions noted). | Surface | Location | Header Actions | Inspect Affordance (List/Table) | Row Actions (max 2 visible) | Bulk Actions (grouped) | Empty-State CTA(s) | View Header Actions | Create/Edit Save+Cancel | Audit log? | Notes / Exemptions | |---|---|---|---|---|---|---|---|---|---|---| | Findings Resource | Admin UI: Findings | Optional: “Triage all matching” (capability-gated) | View action | View, More | Bulk triage, bulk assign, bulk resolve, bulk close, bulk risk accept (under More) | None | Triage, Start progress, Assign, Resolve, Close, Risk accept, Reopen (where allowed) | N/A | Yes | Empty-state exemption: findings are system-generated; no create CTA | | Alert Rules Resource | Monitoring UI: Alert rules | Create (capability/policy-gated) | Clickable row | Edit, More | None (exempt) | Create alert rule | N/A (edit surface) | Save/Cancel | Yes | “SLA due” event type is available once the producer exists | ### Key Entities *(include if feature involves data)* - **Finding**: Represents a detected issue for a tenant, including type, severity, lifecycle status, recurrence behavior, and lifecycle metadata (ownership, due date, seen tracking). - **SLA policy**: Severity-based due-date expectations applied to open findings, with configurable defaults. - **Alert rule**: Workspace-defined routing rules that can trigger delivery when an SLA due event occurs. - **Audit entry**: Immutable record of user-initiated workflow changes for accountability and compliance. ## Success Criteria *(mandatory)* ### Measurable Outcomes - **SC-001**: 100% of open findings have a computed due date (SLA) at creation and after any reopen event. - **SC-002**: Recurring findings reopen instead of creating duplicate open rows for the same recurring issue. - **SC-003**: The default Findings list shows open findings across all finding types without requiring users to remove type-specific filters. - **SC-004**: SLA due alerting is functional: tenants with newly-overdue open findings since the previous evaluation window can trigger alert rules and produce at most one SLA due event per tenant per evaluation window; terminal findings never contribute to SLA due alerts. - **SC-005**: Authorization behavior is correct and non-enumerable: non-members receive 404; members missing capability receive 403. - **SC-006**: Admins can triage/assign/resolve/close findings in bulk for at least 100 findings in a single action without needing per-record manual updates. ## Assumptions - `risk_accepted` is a workflow status only in v2 (no expiry model in this feature). - SLA due dates are set on create and on reopen. Severity changes while a finding remains open do not retroactively change the existing due date unless the finding is reopened. - Backfill sets due dates for legacy open findings from the backfill operation time (backfill time + SLA days) to avoid an immediate “overdue” surge on rollout. - Assignment/ownership pickers show only current tenant members, but historical assignments remain visible for audit/history even if membership is later removed. - Existing alert rules with `event_type = sla_due` are preserved and should become effective once the SLA due producer is implemented (no destructive data migration of workspace-owned alert rules).