# Research: 111 — Findings Workflow V2 + SLA **Date**: 2026-02-24 **Branch**: `111-findings-workflow-sla` --- ## 1. Status Model + Legacy Mapping ### Decision Keep `findings.status` as a string column and expand allowed v2 values. Preserve legacy `acknowledged` rows for compatibility, but treat `acknowledged` as `triaged` in the v2 workflow surface and migrate it during backfill. ### Rationale - Existing findings already use `new|acknowledged|resolved` with `acknowledged_at/by` fields. - Mapping `acknowledged → triaged` preserves intent while enabling the new workflow. - Avoids high-risk data migrations that try to rewrite history beyond what the spec requires. ### Alternatives Considered - Dropping the legacy `acknowledged` status immediately and forcing a hard migration in one deploy: rejected due to rollout risk. --- ## 2. Workflow Enforcement + Timestamps ### Decision Enforce transitions server-side via a dedicated workflow service (single entrypoint used by Filament actions and any future API surfaces). Update timestamps on state changes: - `triaged_at` set on `new|reopened → triaged` - `in_progress_at` set on `triaged → in_progress` - `resolved_at` + `resolved_reason` set on resolve - `closed_at` + `closed_reason` + `closed_by_user_id` set on close and risk accept - `reopened_at` set on reopen When reopening, clear terminal state fields relevant to the previous terminal status (e.g., clear `resolved_at/reason` when moving to `reopened`). ### Rationale - Keeps status validation consistent across UI and background jobs. - Timestamp fields provide direct auditability for “when did we triage/close”. - Clearing terminal fields prevents inconsistent states (e.g., `status=reopened` with `resolved_at` still set). ### Alternatives Considered - Implementing all transitions as ad-hoc model mutations across multiple resources: rejected (harder to test and easy to drift). --- ## 3. SLA Policy Storage (SettingsRegistry) ### Decision Add a workspace-resolvable setting: - `domain = findings` - `key = sla_days` - `type = json` - `systemDefault`: - critical: 3 - high: 7 - medium: 14 - low: 30 Expose it via Workspace Settings UI as a JSON textarea, following the existing `drift.severity_mapping` pattern. ### Rationale - Existing code already uses `SettingsResolver` + `SettingsRegistry` for drift severity mapping. - Keeps SLA policy queryable and adjustable without code deploys. ### Alternatives Considered - Hardcoding SLA days in code/config: rejected (non-configurable and harder to tune per workspace). --- ## 4. due_at Computation Semantics ### Decision - On create (new finding): set `sla_days` (resolved from settings) and set `due_at = first_seen_at + sla_days`. - On reopen (manual or automatic): reset `due_at = now + sla_days(current severity)` and update `sla_days` to the current policy value. - Severity changes while a finding remains open do not retroactively change `due_at` unless the finding is reopened (matches spec assumptions). ### Rationale - Allows stable deadlines during remediation while still resetting on recurrence/reopen. - Reduces surprising “deadline moved” behavior during open triage. ### Alternatives Considered - Recomputing `due_at` on every detection run for open findings: rejected (deadlines drift and become hard to reason about). --- ## 5. SLA Due Alert Event (Tenant-Level Summary) ### Decision Implement an SLA due producer in `EvaluateAlertsJob` that emits **one event per tenant** when a tenant has **newly-overdue** open findings in the evaluation window: - Eligibility for producing an event (per tenant): - `due_at <= now()` - `due_at > windowStart` (newly overdue since last evaluation) - `status IN (new, triaged, in_progress, reopened)` - Event summarizes **current** overdue counts for that tenant (not just newly overdue), so the alert body reflects the real state at emission time. Event fields: - `event_type = sla_due` - `fingerprint_key = sla_due:tenant:{tenant_id}` - `severity = max severity among overdue open findings` (critical if any critical overdue exists) - `metadata` contains counts only (no per-finding payloads) ### Rationale - Avoids creating suppressed `alert_deliveries` every minute for persistently overdue tenants (no prune job exists for `alert_deliveries` today). - Aligns “due” semantics with the due moment: a tenant produces an event when something crosses due. ### Alternatives Considered - Emitting an SLA due event on every evaluation run when overdue exists: rejected due to `alert_deliveries` table growth and suppressed-delivery noise. - Tracking last-emitted state per tenant in a new table: rejected for v1 (adds schema and state complexity). --- ## 6. Drift Recurrence: Stable recurrence_key + Canonical Row ### Decision Add `recurrence_key` (64-char hex) to `findings` and treat it as the stable identity for drift recurrence. For drift findings: - Compute `recurrence_key = sha256("drift:{tenant_id}:{scope_key}:{subject_type}:{subject_external_id}:{dimension}")` - Upsert drift findings by `(tenant_id, recurrence_key)` - Set drift finding `fingerprint = recurrence_key` for canonical drift rows going forward `dimension` is stable and derived from evidence kind and change type: - Policy snapshot drift: `policy_snapshot:{change_type}` - Assignments drift: `policy_assignments` - Scope tags drift: `policy_scope_tags` - Baseline compare drift: `baseline_compare:{change_type}` ### Rationale - Prevents “new row per re-drift” even when baseline/current hashes change. - Avoids conflicts with legacy drift fingerprints during consolidation because new canonical drift fingerprints are stable and distinct. ### Alternatives Considered - Keeping drift fingerprint as baseline/current hash-based and updating it on the canonical row: rejected because it can collide with existing legacy rows (unique `(tenant_id, fingerprint)` constraint). --- ## 7. Drift Stale Auto-Resolve ### Decision When generating drift findings for a scope/run, auto-resolve drift findings that were previously open for that scope but are not detected in the latest run: - Filter: `finding_type=drift`, `scope_key=...`, `status IN (new, triaged, in_progress, reopened)` - Not seen in run’s `recurrence_key` set - Resolve reason: `no_longer_detected` ### Rationale - Keeps “Open” findings aligned with current observed state. - Matches existing generator patterns (permission posture / Entra roles resolve stale records). ### Alternatives Considered - Leaving stale findings open indefinitely: rejected (increases noise and breaks trust in “Open” list). --- ## 8. Backfill + Consolidation (OperationRun-Backed) ### Decision Implement a tenant-scoped backfill/consolidation operation backed by `OperationRun`: - Maps `acknowledged → triaged` - Populates lifecycle fields (`first_seen_at`, `last_seen_at`, `times_seen`, `due_at`, `sla_days`, timestamps) - Computes `recurrence_key` for drift and consolidates duplicates so only one canonical open finding remains per `(tenant_id, recurrence_key)` - Due dates for legacy open findings: `due_at = backfill_started_at + sla_days` (prevents immediate overdue surge) Duplicates strategy: - Choose one canonical row per `(tenant_id, recurrence_key)` (prefer open, else most recently seen) - Non-canonical duplicates become terminal (`resolved` with `resolved_reason=consolidated_duplicate`) and have `recurrence_key` cleared to keep canonical uniqueness simple ### Rationale - Meets OPS-UX requirements (queued toast, progress surfaces, initiator-only terminal notification). - Makes legacy data usable without requiring manual cleanup. ### Alternatives Considered - Deleting duplicate rows: rejected because the spec explicitly allows legacy rows to remain (and deletions are harder to justify operationally). --- ## 9. Capabilities + RBAC Enforcement ### Decision Add tenant-context capabilities: - `TENANT_FINDINGS_VIEW` - `TENANT_FINDINGS_TRIAGE` - `TENANT_FINDINGS_ASSIGN` - `TENANT_FINDINGS_RESOLVE` - `TENANT_FINDINGS_CLOSE` - `TENANT_FINDINGS_RISK_ACCEPT` Keep `TENANT_FINDINGS_ACKNOWLEDGE` as a deprecated alias for v2 triage permission: - UI enforcement and server-side policy checks treat `ACKNOWLEDGE` as sufficient for triage during the migration window. ### Rationale - Aligns with RBAC-UX constitution requirements (registry-only strings, 404/403 semantics). - Allows incremental rollout without breaking existing role mappings. ### Alternatives Considered - Forcing all tenants to update role mappings at deploy time: rejected (operationally brittle).