TenantAtlas/specs/111-findings-workflow-sla/research.md
2026-02-25 02:45:20 +01:00

194 lines
8.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Research: 111 — Findings Workflow V2 + SLA
**Date**: 2026-02-24
**Branch**: `111-findings-workflow-sla`
---
## 1. Status Model + Legacy Mapping
### Decision
Keep `findings.status` as a string column and expand allowed v2 values. Preserve legacy `acknowledged` rows for compatibility, but treat `acknowledged` as `triaged` in the v2 workflow surface and migrate it during backfill.
### Rationale
- Existing findings already use `new|acknowledged|resolved` with `acknowledged_at/by` fields.
- Mapping `acknowledged → triaged` preserves intent while enabling the new workflow.
- Avoids high-risk data migrations that try to rewrite history beyond what the spec requires.
### Alternatives Considered
- Dropping the legacy `acknowledged` status immediately and forcing a hard migration in one deploy: rejected due to rollout risk.
---
## 2. Workflow Enforcement + Timestamps
### Decision
Enforce transitions server-side via a dedicated workflow service (single entrypoint used by Filament actions and any future API surfaces). Update timestamps on state changes:
- `triaged_at` set on `new|reopened → triaged`
- `in_progress_at` set on `triaged → in_progress`
- `resolved_at` + `resolved_reason` set on resolve
- `closed_at` + `closed_reason` + `closed_by_user_id` set on close and risk accept
- `reopened_at` set on reopen
When reopening, clear terminal state fields relevant to the previous terminal status (e.g., clear `resolved_at/reason` when moving to `reopened`).
### Rationale
- Keeps status validation consistent across UI and background jobs.
- Timestamp fields provide direct auditability for “when did we triage/close”.
- Clearing terminal fields prevents inconsistent states (e.g., `status=reopened` with `resolved_at` still set).
### Alternatives Considered
- Implementing all transitions as ad-hoc model mutations across multiple resources: rejected (harder to test and easy to drift).
---
## 3. SLA Policy Storage (SettingsRegistry)
### Decision
Add a workspace-resolvable setting:
- `domain = findings`
- `key = sla_days`
- `type = json`
- `systemDefault`:
- critical: 3
- high: 7
- medium: 14
- low: 30
Expose it via Workspace Settings UI as a JSON textarea, following the existing `drift.severity_mapping` pattern.
### Rationale
- Existing code already uses `SettingsResolver` + `SettingsRegistry` for drift severity mapping.
- Keeps SLA policy queryable and adjustable without code deploys.
### Alternatives Considered
- Hardcoding SLA days in code/config: rejected (non-configurable and harder to tune per workspace).
---
## 4. due_at Computation Semantics
### Decision
- On create (new finding): set `sla_days` (resolved from settings) and set `due_at = first_seen_at + sla_days`.
- On reopen (manual or automatic): reset `due_at = now + sla_days(current severity)` and update `sla_days` to the current policy value.
- Severity changes while a finding remains open do not retroactively change `due_at` unless the finding is reopened (matches spec assumptions).
### Rationale
- Allows stable deadlines during remediation while still resetting on recurrence/reopen.
- Reduces surprising “deadline moved” behavior during open triage.
### Alternatives Considered
- Recomputing `due_at` on every detection run for open findings: rejected (deadlines drift and become hard to reason about).
---
## 5. SLA Due Alert Event (Tenant-Level Summary)
### Decision
Implement an SLA due producer in `EvaluateAlertsJob` that emits **one event per tenant** when a tenant has **newly-overdue** open findings in the evaluation window:
- Eligibility for producing an event (per tenant):
- `due_at <= now()`
- `due_at > windowStart` (newly overdue since last evaluation)
- `status IN (new, triaged, in_progress, reopened)`
- Event summarizes **current** overdue counts for that tenant (not just newly overdue), so the alert body reflects the real state at emission time.
Event fields:
- `event_type = sla_due`
- `fingerprint_key = sla_due:tenant:{tenant_id}`
- `severity = max severity among overdue open findings` (critical if any critical overdue exists)
- `metadata` contains counts only (no per-finding payloads)
### Rationale
- Avoids creating suppressed `alert_deliveries` every minute for persistently overdue tenants (no prune job exists for `alert_deliveries` today).
- Aligns “due” semantics with the due moment: a tenant produces an event when something crosses due.
### Alternatives Considered
- Emitting an SLA due event on every evaluation run when overdue exists: rejected due to `alert_deliveries` table growth and suppressed-delivery noise.
- Tracking last-emitted state per tenant in a new table: rejected for v1 (adds schema and state complexity).
---
## 6. Drift Recurrence: Stable recurrence_key + Canonical Row
### Decision
Add `recurrence_key` (64-char hex) to `findings` and treat it as the stable identity for drift recurrence. For drift findings:
- Compute `recurrence_key = sha256("drift:{tenant_id}:{scope_key}:{subject_type}:{subject_external_id}:{dimension}")`
- Upsert drift findings by `(tenant_id, recurrence_key)`
- Set drift finding `fingerprint = recurrence_key` for canonical drift rows going forward
`dimension` is stable and derived from evidence kind and change type:
- Policy snapshot drift: `policy_snapshot:{change_type}`
- Assignments drift: `policy_assignments`
- Scope tags drift: `policy_scope_tags`
- Baseline compare drift: `baseline_compare:{change_type}`
### Rationale
- Prevents “new row per re-drift” even when baseline/current hashes change.
- Avoids conflicts with legacy drift fingerprints during consolidation because new canonical drift fingerprints are stable and distinct.
### Alternatives Considered
- Keeping drift fingerprint as baseline/current hash-based and updating it on the canonical row: rejected because it can collide with existing legacy rows (unique `(tenant_id, fingerprint)` constraint).
---
## 7. Drift Stale Auto-Resolve
### Decision
When generating drift findings for a scope/run, auto-resolve drift findings that were previously open for that scope but are not detected in the latest run:
- Filter: `finding_type=drift`, `scope_key=...`, `status IN (new, triaged, in_progress, reopened)`
- Not seen in runs `recurrence_key` set
- Resolve reason: `no_longer_detected`
### Rationale
- Keeps “Open” findings aligned with current observed state.
- Matches existing generator patterns (permission posture / Entra roles resolve stale records).
### Alternatives Considered
- Leaving stale findings open indefinitely: rejected (increases noise and breaks trust in “Open” list).
---
## 8. Backfill + Consolidation (OperationRun-Backed)
### Decision
Implement a tenant-scoped backfill/consolidation operation backed by `OperationRun`:
- Maps `acknowledged → triaged`
- Populates lifecycle fields (`first_seen_at`, `last_seen_at`, `times_seen`, `due_at`, `sla_days`, timestamps)
- Computes `recurrence_key` for drift and consolidates duplicates so only one canonical open finding remains per `(tenant_id, recurrence_key)`
- Due dates for legacy open findings: `due_at = backfill_started_at + sla_days` (prevents immediate overdue surge)
Duplicates strategy:
- Choose one canonical row per `(tenant_id, recurrence_key)` (prefer open, else most recently seen)
- Non-canonical duplicates become terminal (`resolved` with `resolved_reason=consolidated_duplicate`) and have `recurrence_key` cleared to keep canonical uniqueness simple
### Rationale
- Meets OPS-UX requirements (queued toast, progress surfaces, initiator-only terminal notification).
- Makes legacy data usable without requiring manual cleanup.
### Alternatives Considered
- Deleting duplicate rows: rejected because the spec explicitly allows legacy rows to remain (and deletions are harder to justify operationally).
---
## 9. Capabilities + RBAC Enforcement
### Decision
Add tenant-context capabilities:
- `TENANT_FINDINGS_VIEW`
- `TENANT_FINDINGS_TRIAGE`
- `TENANT_FINDINGS_ASSIGN`
- `TENANT_FINDINGS_RESOLVE`
- `TENANT_FINDINGS_CLOSE`
- `TENANT_FINDINGS_RISK_ACCEPT`
Keep `TENANT_FINDINGS_ACKNOWLEDGE` as a deprecated alias for v2 triage permission:
- UI enforcement and server-side policy checks treat `ACKNOWLEDGE` as sufficient for triage during the migration window.
### Rationale
- Aligns with RBAC-UX constitution requirements (registry-only strings, 404/403 semantics).
- Allows incremental rollout without breaking existing role mappings.
### Alternatives Considered
- Forcing all tenants to update role mappings at deploy time: rejected (operationally brittle).