TenantAtlas/specs/111-findings-workflow-sla/spec.md
ahmido 7ac53f4cc4 feat(111): findings workflow + SLA settings (#135)
Implements spec 111 (Findings workflow + SLA) and fixes Workspace findings SLA settings UX/validation.

Key changes:
- Findings workflow service + SLA policy and alerting.
- Workspace settings: allow partial SLA overrides without auto-filling unset severities in the UI; effective values still resolve via defaults.
- New migrations, jobs, command, UI/resource updates, and comprehensive test coverage.

Tests:
- `vendor/bin/sail artisan test --compact` (1779 passed, 8 skipped).

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #135
2026-02-25 01:48:01 +00:00

22 KiB
Raw Blame History

Feature Specification: Findings Workflow V2 + SLA

Feature Branch: 111-findings-workflow-sla Created: 2026-02-24 Status: Draft Depends On: specs/104-provider-permission-posture/spec.md, specs/105-entra-admin-roles-evidence-findings/spec.md, specs/109-review-pack-export/spec.md Input: Standardize the Findings lifecycle (workflow, ownership, recurrence, SLA due dates, and alerting) so findings management is enterprise-usable and not “noise”.

Clarifications

Session 2026-02-24

  • Q: What should happen when the same finding is detected again, but its current status is terminal? → A: Auto-reopen only from resolved; closed and risk_accepted remain terminal (still update seen tracking fields).
  • Q: When backfilling legacy open findings, how should the initial due date be set? → A: Compute from the backfill operation time (backfill time + SLA days).
  • Q: When SLA due alerts fire, what should a single alert event represent? → A: At most one event per tenant per alert-evaluation window, emitted only when newly-overdue open findings exist; the event summarizes current overdue counts.
  • Q: Which statuses should count as “Open” for the default Findings list and for SLA due evaluation? → A: Open = new, triaged, in_progress, reopened.
  • Q: From which statuses should a user be able to manually “Reopen” a finding (into reopened status)? → A: Allow manual reopen from resolved, closed, and risk_accepted.
  • Q: Where is the SLA policy configured, and what scope does it apply to? → A: Workspace-scoped setting (findings.sla_days) in Workspace Settings; applies to all tenants in the workspace.
  • Q: How is the “alert-evaluation window” defined for SLA due gating? → A: Use the Alerts evaluation window start time (previous completed alerts.evaluate OperationRun completed_at; fallback to initial lookback). “Newly overdue” means due_at in (window_start, now] for open findings.
  • Q: What must an sla_due event contain? → A: One event per tenant per evaluation window; metadata includes overdue_total and overdue_by_severity (critical/high/medium/low) for currently overdue open findings; fingerprint is stable per tenant+window.
  • Q: If severity changes while a finding remains open, should due_at be recalculated? → A: No — due_at is set on create and reset only on reopen/backfill.
  • Q: If a user resolves a finding while a detection run is processing, how is consistency maintained? → A: Detection updates may still advance seen counters, but automatic reopen MUST occur only when the observation time is after resolved_at.

Spec Scope Fields (mandatory)

  • Scope: tenant (Findings management) + workspace (SLA policy + Alert rules configuration)
  • Primary Routes:
    • Tenant-context: Findings list + view (/admin/t/{tenant}/...)
    • Workspace-context Monitoring: Alert rules list + edit (/admin/...)
    • Workspace-context Settings: Workspace Settings (Findings SLA policy) (/admin/...)
  • Data Ownership:
    • Tenant-owned: Findings and their lifecycle metadata
    • Workspace-owned: SLA policy settings (findings.sla_days)
    • Workspace-owned: Alert rules configuration (event types)
  • RBAC:
    • Findings view + workflow actions are tenant-context capability-gated
    • Workspace Settings + Alert rules remain workspace capability/policy-gated (existing behavior)

Canonical-view fields not applicable — this spec updates tenant-context Findings and workspace-scoped Alert Rules.

User Scenarios & Testing (mandatory)

User Story 1 - See Open Findings (Priority: P1)

As a tenant operator, I can open the Findings page and immediately see the current open findings across all finding types, so I dont miss non-drift issues and can focus on what needs attention now.

Why this priority: If open findings are hidden by default filters or type assumptions, findings become unreliable as an operational surface.

Independent Test: Seed a tenant with findings across multiple types and statuses, then verify the default list shows open workflow statuses across all types without adjusting filters.

Acceptance Scenarios:

  1. Given a tenant has findings of types drift, permission posture, and Entra admin roles, When I open the Findings list, Then I can see open findings from all types without changing any filters.
  2. Given a tenant has a mix of open and terminal findings, When I open the Findings list, Then the default list shows only open workflow statuses.
  3. Given a tenant has overdue findings, When I use the “Overdue” quick filter, Then only findings past their due date are shown.
  4. Given a tenant has open findings, When I view the list, Then I can see each findings status, severity, due date, and assignee (when set).

User Story 2 - Triage, Assign, And Resolve (Priority: P1)

As a tenant manager, I can triage findings, assign ownership, and move findings through a consistent workflow (including reasons and auditability), so the team can reliably manage remediation.

Why this priority: Without a consistent workflow and ownership, findings degrade into noisy, un-actioned rows with unclear accountability.

Independent Test: Create an open finding, execute each allowed status transition, and verify transitions are enforced server-side, recorded with timestamps/actors, and audited.

Acceptance Scenarios:

  1. Given a finding in new (or reopened) status, When I triage it, Then the status becomes triaged and the triage timestamp is recorded.
  2. Given a finding in triaged status, When I start progress, Then the status becomes in_progress and the progress timestamp is recorded.
  3. Given a finding in an open status, When I assign an assignee (and optional owner), Then those fields are saved and displayed on the finding.
  4. Given a finding in an open status, When I resolve it with a resolution reason, Then it becomes resolved and the resolution reason is persisted.
  5. Given a finding in any status, When I close it with a close reason, Then it becomes closed and the close reason is persisted.
  6. Given a finding in any status, When I mark it as risk accepted with a reason, Then it becomes risk_accepted and the reason is persisted.
  7. Given a user without the relevant capability, When they attempt any workflow mutation, Then the server denies it (403 for members lacking capability; 404 for non-members / not entitled).

User Story 3 - SLA Due Visibility And Alerts (Priority: P1)

As a workspace operator, I can configure alerting for findings that are past their due date (SLA due), so overdue findings reliably escalate beyond the Findings page.

Why this priority: An SLA without alerting becomes “best effort” and is easy to ignore in busy operations.

Independent Test: Create newly-overdue open findings for a tenant, run alert evaluation, and verify a single tenant-level SLA due event is produced and can match an enabled alert rule.

Acceptance Scenarios:

  1. Given a tenant has one or more newly-overdue open findings since the previous evaluation window, When alert evaluation runs, Then exactly one SLA due event is produced for that tenant and can trigger an enabled alert rule.
  2. Given a tenant has no overdue open findings (including when only terminal findings have past due dates), When alert evaluation runs, Then no SLA due event is produced for that tenant.
  3. Given I edit an alert rule, When I choose the event type, Then “SLA due” is available as a selectable event type.
  4. Given a tenant has overdue open findings but no newly-overdue open findings since the previous evaluation window, When alert evaluation runs, Then no additional SLA due event is produced for that tenant.
  5. Given an SLA due event is produced, When I inspect the event payload, Then it includes overdue counts total and by severity.

User Story 4 - Recurrence Reopens (Priority: P2)

As a tenant operator, when a previously resolved finding reappears in later detection runs, it reopens the original finding (instead of creating a new duplicate), so recurrence is visible and manageable.

Why this priority: Recurrence is operationally important, and duplicate rows create confusion and reporting noise.

Independent Test: Simulate a finding being resolved and then being detected again, verifying it transitions to reopened, counters update, and due date resets.

Acceptance Scenarios:

  1. Given a finding was resolved, When it is detected again, Then the same finding transitions to reopened and records a reopened timestamp.
  2. Given a finding is detected in successive runs, When it appears again, Then the last-seen timestamp updates and the seen counter increases.
  3. Given a drift finding is no longer detected in the latest run, When stale detection is evaluated, Then the drift finding is auto-resolved with reason “no longer detected”.
  4. Given a finding is closed or risk_accepted, When it is detected again, Then it remains terminal and only its seen tracking fields update.

User Story 5 - Bulk Manage Findings (Priority: P3)

As a tenant manager, I can triage/assign/resolve/close findings in bulk, so I can manage high volumes efficiently while preserving auditability and safety.

Why this priority: Bulk workflow reduces operational load, but can ship after the single-record workflow is correct.

Independent Test: Select multiple findings and run each bulk action, verifying that all selected findings update consistently and each change is audited.

Acceptance Scenarios:

  1. Given I select multiple open findings, When I bulk triage them, Then all selected findings become triaged.
  2. Given I select multiple open findings, When I bulk assign an assignee, Then all selected findings are assigned.
  3. Given I select multiple open findings, When I bulk resolve them with a reason, Then all selected findings become resolved and record the reason.
  4. Given I select multiple open findings, When I bulk close them with a reason, Then all selected findings become closed and record the close reason.
  5. Given I select multiple open findings, When I bulk risk accept them with a reason, Then all selected findings become risk_accepted and record the reason.
  6. Given more than 100 open findings match my current filters, When I run “Triage all matching”, Then the action requires typed confirmation, updates all matching findings safely, and audits each change.

User Story 6 - Backfill Existing Findings (Priority: P2)

As a tenant operator, I can run a one-time backfill/consolidation operation to upgrade existing findings into the v2 workflow model, so older data is usable (due dates, counters, recurrence) without manual cleanup.

Why this priority: Without backfill, existing tenants keep legacy/incomplete findings and the new workflow appears inconsistent or broken.

Independent Test: Seed legacy findings (missing lifecycle fields, acknowledged status, drift duplicates), run the backfill operation, and verify fields are populated, statuses are mapped, and duplicates are consolidated.

Acceptance Scenarios:

  1. Given legacy open findings exist without due dates or lifecycle timestamps, When I run the backfill operation, Then open findings receive due dates set to the backfill operation time plus the SLA days for their severity, and lifecycle metadata is populated.
  2. Given legacy findings in acknowledged status exist, When I run the backfill operation, Then they appear as triaged in the v2 workflow surface.
  3. Given duplicate drift findings exist for the same recurring issue, When I run the backfill operation, Then duplicates are consolidated so only one canonical open finding remains.

Edge Cases

  • Legacy findings exist without lifecycle timestamps or due dates (backfill required).
  • A previously assigned/owned user is no longer a tenant member (retain historical assignment, but prevent selecting non-members for new assignments).
  • A findings severity changes while it remains open (assumption on due date recalculation documented below).
  • An SLA due alert rule exists from earlier versions (should begin working once the producer exists; no data loss).
  • Concurrent actions: a user resolves a finding while a detection run marks it seen again (system remains consistent and auditable).

Requirements (mandatory)

Governance And Safety Requirements

  • This feature introduces no new external API calls.
  • All user-initiated workflow mutations (triage/assign/resolve/close/risk accept/reopen) MUST be audited with actor, tenant, action, target, before/after, and timestamp.
    • Audit before/after MUST be limited to workflow/assignment metadata (e.g., status, severity, due_at, assignee_id, owner_id, triaged_at, in_progress_at, resolved_at, closed_at, resolution_reason, close_reason, risk_accepted_reason) and MUST NOT include raw evidence payloads or secrets/tokens.
  • The lifecycle backfill/consolidation operation MUST be observable as an operation with:
    • clear start feedback (accepted/queued),
    • progress visibility while running, and
    • a single terminal outcome notification for the initiator.
  • Authorization MUST be enforced server-side for every mutation with deny-as-not-found semantics:
    • non-members or users not entitled to the tenant scope → 404
    • members missing capability → 403
  • Destructive-like actions (resolve/close/risk accept) MUST require explicit confirmation.
  • Findings status badge semantics MUST remain centralized and cover every allowed status.

Functional Requirements

  • FR-001: System MUST support a Findings lifecycle with statuses: new, triaged, in_progress, reopened, resolved, closed, risk_accepted.
  • FR-002: System MUST enforce allowed status transitions server-side:
    • new|reopenedtriaged
    • triagedin_progress
    • new|reopened|triaged|in_progressresolved (resolution reason required)
    • resolved|closed|risk_acceptedreopened (manual allowed; requires confirmation; automatic only when detected again from resolved)
    • *closed (close reason required)
    • *risk_accepted (reason required)
  • FR-003: Each finding MUST track lifecycle metadata: owner, assignee, first-seen time, last-seen time, seen count, and (when open) an SLA due date.
  • FR-004: The system MUST assign an SLA due date to open findings using a configurable severity-based policy with defaults:
    • critical: 3 days
    • high: 7 days
    • medium: 14 days
    • low: 30 days
  • FR-005: When a finding reopens (automatic or manual), the system MUST reset the SLA due date based on the current severity-based SLA policy.
  • FR-006: SLA due alerting MUST exist:
    • “SLA due” MUST be available as an alert rule event type (sla_due).
    • The SLA due producer MUST use the same alert-evaluation window start time (window_start) used by Alerts evaluation (previous completed alerts.evaluate OperationRun completed_at; fallback to initial lookback).
    • “Newly overdue” means: an open finding with due_at in (window_start, now].
    • The system MUST emit exactly one SLA due event per tenant per alert-evaluation window when that tenant has one or more newly-overdue open findings since window_start.
    • Each SLA due event MUST summarize current overdue open findings for the tenant and include:
      • overdue_total (count)
      • overdue_by_severity (critical, high, medium, low)
    • A tenant with persistently overdue open findings MUST NOT emit repeated SLA due events on every evaluation run unless additional findings become newly overdue.
    • Terminal statuses (resolved, closed, risk_accepted) MUST NOT contribute to the overdue counts.
    • Open workflow statuses are new, triaged, in_progress, reopened.
    • The events fingerprint_key MUST be stable per tenant + alert-evaluation window for idempotency.
  • FR-007: The system MUST track recurrence:
    • When a previously resolved finding is detected again, it MUST transition to reopened (not create a duplicate open finding for the same recurring issue).
    • When a closed or risk_accepted finding is detected again, it MUST NOT change status automatically; it only updates seen tracking fields.
    • Each detection run where the finding is observed MUST update last-seen time and increment seen count.
    • Concurrency safety: automatic reopen MUST occur only when the observation time is after the findings resolved_at.
  • FR-008: Drift findings MUST avoid “new row per re-drift” noise by using a stable recurrence identity so recurring drift reopens the canonical finding.
  • FR-009: Drift findings MUST auto-resolve when they are no longer detected in the latest run, with a consistent resolved reason (e.g., “no longer detected”).
  • FR-010: Findings list defaults MUST be safe and visible:
    • Default list shows open statuses (new, triaged, in_progress, reopened) across all finding types (no drift-only default).
    • Quick filters exist for: Open, Overdue, High severity, My assigned.
  • FR-011: Findings UI MUST provide safe workflow actions:
    • Single-record actions: triage, start progress, assign (assignee and optional owner), resolve (reason required), close (reason required), risk accept (reason required), reopen (where allowed).
    • Bulk actions: bulk triage, bulk assign, bulk resolve, bulk close, bulk risk accept.
  • FR-012: The system MUST introduce tenant-context capabilities for Findings management:
    • TENANT_FINDINGS_VIEW
    • TENANT_FINDINGS_TRIAGE
    • TENANT_FINDINGS_ASSIGN
    • TENANT_FINDINGS_RESOLVE
    • TENANT_FINDINGS_CLOSE
    • TENANT_FINDINGS_RISK_ACCEPT
  • FR-013: Assignment/ownership selection MUST be limited to users who are currently tenant members, while preserving historical assignment/ownership values for already-assigned findings.
  • FR-014: Legacy compatibility MUST be maintained:
    • Existing acknowledged status MUST be treated as triaged in the v2 workflow surface.
    • Existing TENANT_FINDINGS_ACKNOWLEDGE capability MUST act as a deprecated alias for v2 triage permission.
  • FR-015: A backfill/consolidation operation MUST exist to migrate existing findings to the v2 lifecycle model, including:
    • mapping acknowledgedtriaged
    • populating lifecycle timestamps and seen counters for existing data
    • setting due dates for legacy open findings based on the backfill operation time (backfill time + SLA days)
    • consolidating duplicates where recurrence identity indicates the same recurring finding (canonical record retained; duplicates marked terminal with a consistent reason, e.g. consolidated_duplicate)
  • FR-016: Severity changes while a finding remains open MUST NOT retroactively change due_at. due_at is assigned on create and reset only on reopen/backfill.
  • FR-017: Review pack generation MUST treat “open findings” using the v2 open-status set (not drift-only defaults) to keep existing exports/review packs consistent.

UI Action Matrix (mandatory when Filament is changed)

Action Surface Contract: Satisfied for Findings and Alert Rules (explicit exemptions noted).

Surface Location Header Actions Inspect Affordance (List/Table) Row Actions (max 2 visible) Bulk Actions (grouped) Empty-State CTA(s) View Header Actions Create/Edit Save+Cancel Audit log? Notes / Exemptions
Findings Resource Admin UI: Findings Optional: “Triage all matching” (capability-gated) View action View, More Bulk triage, bulk assign, bulk resolve, bulk close, bulk risk accept (under More) None Triage, Start progress, Assign, Resolve, Close, Risk accept, Reopen (where allowed) N/A Yes Empty-state exemption: findings are system-generated; no create CTA
Alert Rules Resource Monitoring UI: Alert rules Create (capability/policy-gated) Clickable row Edit, More None (exempt) Create alert rule N/A (edit surface) Save/Cancel Yes “SLA due” event type is available once the producer exists

Key Entities (include if feature involves data)

  • Finding: Represents a detected issue for a tenant, including type, severity, lifecycle status, recurrence behavior, and lifecycle metadata (ownership, due date, seen tracking).
  • SLA policy: Severity-based due-date expectations applied to open findings, with configurable defaults.
  • Alert rule: Workspace-defined routing rules that can trigger delivery when an SLA due event occurs.
  • Audit entry: Immutable record of user-initiated workflow changes for accountability and compliance.

Success Criteria (mandatory)

Measurable Outcomes

  • SC-001: 100% of open findings have a computed due date (SLA) at creation and after any reopen event.
  • SC-002: Recurring findings reopen instead of creating duplicate open rows for the same recurring issue.
  • SC-003: The default Findings list shows open findings across all finding types without requiring users to remove type-specific filters.
  • SC-004: SLA due alerting is functional: tenants with newly-overdue open findings since the previous evaluation window can trigger alert rules and produce at most one SLA due event per tenant per evaluation window; terminal findings never contribute to SLA due alerts.
  • SC-005: Authorization behavior is correct and non-enumerable: non-members receive 404; members missing capability receive 403.
  • SC-006: Admins can triage/assign/resolve/close findings in bulk for at least 100 findings in a single action without needing per-record manual updates.

Assumptions

  • risk_accepted is a workflow status only in v2 (no expiry model in this feature).
  • SLA due dates are set on create and on reopen. Severity changes while a finding remains open do not retroactively change the existing due date unless the finding is reopened.
  • Backfill sets due dates for legacy open findings from the backfill operation time (backfill time + SLA days) to avoid an immediate “overdue” surge on rollout.
  • Assignment/ownership pickers show only current tenant members, but historical assignments remain visible for audit/history even if membership is later removed.
  • Existing alert rules with event_type = sla_due are preserved and should become effective once the SLA due producer is implemented (no destructive data migration of workspace-owned alert rules).