TenantAtlas/specs/111-findings-workflow-sla/research.md

# Research: 111 — Findings Workflow V2 + SLA

**Date**: 2026-02-24
**Branch**: `111-findings-workflow-sla`

---

## 1. Status Model + Legacy Mapping

### Decision
Keep `findings.status` as a string column and expand allowed v2 values. Preserve legacy `acknowledged` rows for compatibility, but treat `acknowledged` as `triaged` in the v2 workflow surface and migrate it during backfill.

### Rationale
- Existing findings already use `new|acknowledged|resolved` with `acknowledged_at/by` fields.
- Mapping `acknowledged → triaged` preserves intent while enabling the new workflow.
- Avoids high-risk data migrations that try to rewrite history beyond what the spec requires.

### Alternatives Considered
- Dropping the legacy `acknowledged` status immediately and forcing a hard migration in one deploy: rejected due to rollout risk.

---

## 2. Workflow Enforcement + Timestamps

### Decision
Enforce transitions server-side via a dedicated workflow service (single entrypoint used by Filament actions and any future API surfaces). Update timestamps on state changes:
- `triaged_at` set on `new|reopened → triaged`
- `in_progress_at` set on `triaged → in_progress`
- `resolved_at` + `resolved_reason` set on resolve
- `closed_at` + `closed_reason` + `closed_by_user_id` set on close and risk accept
- `reopened_at` set on reopen

When reopening, clear terminal state fields relevant to the previous terminal status (e.g., clear `resolved_at/reason` when moving to `reopened`).

### Rationale
- Keeps status validation consistent across UI and background jobs.
- Timestamp fields provide direct auditability for “when did we triage/close”.
- Clearing terminal fields prevents inconsistent states (e.g., `status=reopened` with `resolved_at` still set).

### Alternatives Considered
- Implementing all transitions as ad-hoc model mutations across multiple resources: rejected (harder to test and easy to drift).

---

## 3. SLA Policy Storage (SettingsRegistry)

### Decision
Add a workspace-resolvable setting:
- `domain = findings`
- `key = sla_days`
- `type = json`
- `systemDefault`:
  - critical: 3
  - high: 7
  - medium: 14
  - low: 30

Expose it via Workspace Settings UI as a JSON textarea, following the existing `drift.severity_mapping` pattern.

### Rationale
- Existing code already uses `SettingsResolver` + `SettingsRegistry` for drift severity mapping.
- Keeps SLA policy queryable and adjustable without code deploys.

### Alternatives Considered
- Hardcoding SLA days in code/config: rejected (non-configurable and harder to tune per workspace).

---

## 4. due_at Computation Semantics

### Decision
- On create (new finding): set `sla_days` (resolved from settings) and set `due_at = first_seen_at + sla_days`.
- On reopen (manual or automatic): reset `due_at = now + sla_days(current severity)` and update `sla_days` to the current policy value.
- Severity changes while a finding remains open do not retroactively change `due_at` unless the finding is reopened (matches spec assumptions).

### Rationale
- Allows stable deadlines during remediation while still resetting on recurrence/reopen.
- Reduces surprising “deadline moved” behavior during open triage.

### Alternatives Considered
- Recomputing `due_at` on every detection run for open findings: rejected (deadlines drift and become hard to reason about).

---

## 5. SLA Due Alert Event (Tenant-Level Summary)

### Decision
Implement an SLA due producer in `EvaluateAlertsJob` that emits **one event per tenant** when a tenant has **newly-overdue** open findings in the evaluation window:
- Eligibility for producing an event (per tenant):
  - `due_at <= now()`
  - `due_at > windowStart` (newly overdue since last evaluation)
  - `status IN (new, triaged, in_progress, reopened)`
- Event summarizes **current** overdue counts for that tenant (not just newly overdue), so the alert body reflects the real state at emission time.

Event fields:
- `event_type = sla_due`
- `fingerprint_key = sla_due:tenant:{tenant_id}`
- `severity = max severity among overdue open findings` (critical if any critical overdue exists)
- `metadata` contains counts only (no per-finding payloads)

### Rationale
- Avoids creating suppressed `alert_deliveries` every minute for persistently overdue tenants (no prune job exists for `alert_deliveries` today).
- Aligns “due” semantics with the due moment: a tenant produces an event when something crosses due.

### Alternatives Considered
- Emitting an SLA due event on every evaluation run when overdue exists: rejected due to `alert_deliveries` table growth and suppressed-delivery noise.
- Tracking last-emitted state per tenant in a new table: rejected for v1 (adds schema and state complexity).

---

## 6. Drift Recurrence: Stable recurrence_key + Canonical Row

### Decision
Add `recurrence_key` (64-char hex) to `findings` and treat it as the stable identity for drift recurrence. For drift findings:
- Compute `recurrence_key = sha256("drift:{tenant_id}:{scope_key}:{subject_type}:{subject_external_id}:{dimension}")`
- Upsert drift findings by `(tenant_id, recurrence_key)`
- Set drift finding `fingerprint = recurrence_key` for canonical drift rows going forward

`dimension` is stable and derived from evidence kind and change type:
- Policy snapshot drift: `policy_snapshot:{change_type}`
- Assignments drift: `policy_assignments`
- Scope tags drift: `policy_scope_tags`
- Baseline compare drift: `baseline_compare:{change_type}`

### Rationale
- Prevents “new row per re-drift” even when baseline/current hashes change.
- Avoids conflicts with legacy drift fingerprints during consolidation because new canonical drift fingerprints are stable and distinct.

### Alternatives Considered
- Keeping drift fingerprint as baseline/current hash-based and updating it on the canonical row: rejected because it can collide with existing legacy rows (unique `(tenant_id, fingerprint)` constraint).

---

## 7. Drift Stale Auto-Resolve

### Decision
When generating drift findings for a scope/run, auto-resolve drift findings that were previously open for that scope but are not detected in the latest run:
- Filter: `finding_type=drift`, `scope_key=...`, `status IN (new, triaged, in_progress, reopened)`
- Not seen in run’s `recurrence_key` set
- Resolve reason: `no_longer_detected`

### Rationale
- Keeps “Open” findings aligned with current observed state.
- Matches existing generator patterns (permission posture / Entra roles resolve stale records).

### Alternatives Considered
- Leaving stale findings open indefinitely: rejected (increases noise and breaks trust in “Open” list).

---

## 8. Backfill + Consolidation (OperationRun-Backed)

### Decision
Implement a tenant-scoped backfill/consolidation operation backed by `OperationRun`:
- Maps `acknowledged → triaged`
- Populates lifecycle fields (`first_seen_at`, `last_seen_at`, `times_seen`, `due_at`, `sla_days`, timestamps)
- Computes `recurrence_key` for drift and consolidates duplicates so only one canonical open finding remains per `(tenant_id, recurrence_key)`
- Due dates for legacy open findings: `due_at = backfill_started_at + sla_days` (prevents immediate overdue surge)

Duplicates strategy:
- Choose one canonical row per `(tenant_id, recurrence_key)` (prefer open, else most recently seen)
- Non-canonical duplicates become terminal (`resolved` with `resolved_reason=consolidated_duplicate`) and have `recurrence_key` cleared to keep canonical uniqueness simple

### Rationale
- Meets OPS-UX requirements (queued toast, progress surfaces, initiator-only terminal notification).
- Makes legacy data usable without requiring manual cleanup.

### Alternatives Considered
- Deleting duplicate rows: rejected because the spec explicitly allows legacy rows to remain (and deletions are harder to justify operationally).

---

## 9. Capabilities + RBAC Enforcement

### Decision
Add tenant-context capabilities:
- `TENANT_FINDINGS_VIEW`
- `TENANT_FINDINGS_TRIAGE`
- `TENANT_FINDINGS_ASSIGN`
- `TENANT_FINDINGS_RESOLVE`
- `TENANT_FINDINGS_CLOSE`
- `TENANT_FINDINGS_RISK_ACCEPT`

Keep `TENANT_FINDINGS_ACKNOWLEDGE` as a deprecated alias for v2 triage permission:
- UI enforcement and server-side policy checks treat `ACKNOWLEDGE` as sufficient for triage during the migration window.

### Rationale
- Aligns with RBAC-UX constitution requirements (registry-only strings, 404/403 semantics).
- Allows incremental rollout without breaking existing role mappings.

### Alternatives Considered
- Forcing all tenants to update role mappings at deploy time: rejected (operationally brittle).