8.4 KiB
Research: 111 — Findings Workflow V2 + SLA
Date: 2026-02-24
Branch: 111-findings-workflow-sla
1. Status Model + Legacy Mapping
Decision
Keep findings.status as a string column and expand allowed v2 values. Preserve legacy acknowledged rows for compatibility, but treat acknowledged as triaged in the v2 workflow surface and migrate it during backfill.
Rationale
- Existing findings already use
new|acknowledged|resolvedwithacknowledged_at/byfields. - Mapping
acknowledged → triagedpreserves intent while enabling the new workflow. - Avoids high-risk data migrations that try to rewrite history beyond what the spec requires.
Alternatives Considered
- Dropping the legacy
acknowledgedstatus immediately and forcing a hard migration in one deploy: rejected due to rollout risk.
2. Workflow Enforcement + Timestamps
Decision
Enforce transitions server-side via a dedicated workflow service (single entrypoint used by Filament actions and any future API surfaces). Update timestamps on state changes:
triaged_atset onnew|reopened → triagedin_progress_atset ontriaged → in_progressresolved_at+resolved_reasonset on resolveclosed_at+closed_reason+closed_by_user_idset on close and risk acceptreopened_atset on reopen
When reopening, clear terminal state fields relevant to the previous terminal status (e.g., clear resolved_at/reason when moving to reopened).
Rationale
- Keeps status validation consistent across UI and background jobs.
- Timestamp fields provide direct auditability for “when did we triage/close”.
- Clearing terminal fields prevents inconsistent states (e.g.,
status=reopenedwithresolved_atstill set).
Alternatives Considered
- Implementing all transitions as ad-hoc model mutations across multiple resources: rejected (harder to test and easy to drift).
3. SLA Policy Storage (SettingsRegistry)
Decision
Add a workspace-resolvable setting:
domain = findingskey = sla_daystype = jsonsystemDefault:- critical: 3
- high: 7
- medium: 14
- low: 30
Expose it via Workspace Settings UI as a JSON textarea, following the existing drift.severity_mapping pattern.
Rationale
- Existing code already uses
SettingsResolver+SettingsRegistryfor drift severity mapping. - Keeps SLA policy queryable and adjustable without code deploys.
Alternatives Considered
- Hardcoding SLA days in code/config: rejected (non-configurable and harder to tune per workspace).
4. due_at Computation Semantics
Decision
- On create (new finding): set
sla_days(resolved from settings) and setdue_at = first_seen_at + sla_days. - On reopen (manual or automatic): reset
due_at = now + sla_days(current severity)and updatesla_daysto the current policy value. - Severity changes while a finding remains open do not retroactively change
due_atunless the finding is reopened (matches spec assumptions).
Rationale
- Allows stable deadlines during remediation while still resetting on recurrence/reopen.
- Reduces surprising “deadline moved” behavior during open triage.
Alternatives Considered
- Recomputing
due_aton every detection run for open findings: rejected (deadlines drift and become hard to reason about).
5. SLA Due Alert Event (Tenant-Level Summary)
Decision
Implement an SLA due producer in EvaluateAlertsJob that emits one event per tenant when a tenant has newly-overdue open findings in the evaluation window:
- Eligibility for producing an event (per tenant):
due_at <= now()due_at > windowStart(newly overdue since last evaluation)status IN (new, triaged, in_progress, reopened)
- Event summarizes current overdue counts for that tenant (not just newly overdue), so the alert body reflects the real state at emission time.
Event fields:
event_type = sla_duefingerprint_key = sla_due:tenant:{tenant_id}severity = max severity among overdue open findings(critical if any critical overdue exists)metadatacontains counts only (no per-finding payloads)
Rationale
- Avoids creating suppressed
alert_deliveriesevery minute for persistently overdue tenants (no prune job exists foralert_deliveriestoday). - Aligns “due” semantics with the due moment: a tenant produces an event when something crosses due.
Alternatives Considered
- Emitting an SLA due event on every evaluation run when overdue exists: rejected due to
alert_deliveriestable growth and suppressed-delivery noise. - Tracking last-emitted state per tenant in a new table: rejected for v1 (adds schema and state complexity).
6. Drift Recurrence: Stable recurrence_key + Canonical Row
Decision
Add recurrence_key (64-char hex) to findings and treat it as the stable identity for drift recurrence. For drift findings:
- Compute
recurrence_key = sha256("drift:{tenant_id}:{scope_key}:{subject_type}:{subject_external_id}:{dimension}") - Upsert drift findings by
(tenant_id, recurrence_key) - Set drift finding
fingerprint = recurrence_keyfor canonical drift rows going forward
dimension is stable and derived from evidence kind and change type:
- Policy snapshot drift:
policy_snapshot:{change_type} - Assignments drift:
policy_assignments - Scope tags drift:
policy_scope_tags - Baseline compare drift:
baseline_compare:{change_type}
Rationale
- Prevents “new row per re-drift” even when baseline/current hashes change.
- Avoids conflicts with legacy drift fingerprints during consolidation because new canonical drift fingerprints are stable and distinct.
Alternatives Considered
- Keeping drift fingerprint as baseline/current hash-based and updating it on the canonical row: rejected because it can collide with existing legacy rows (unique
(tenant_id, fingerprint)constraint).
7. Drift Stale Auto-Resolve
Decision
When generating drift findings for a scope/run, auto-resolve drift findings that were previously open for that scope but are not detected in the latest run:
- Filter:
finding_type=drift,scope_key=...,status IN (new, triaged, in_progress, reopened) - Not seen in run’s
recurrence_keyset - Resolve reason:
no_longer_detected
Rationale
- Keeps “Open” findings aligned with current observed state.
- Matches existing generator patterns (permission posture / Entra roles resolve stale records).
Alternatives Considered
- Leaving stale findings open indefinitely: rejected (increases noise and breaks trust in “Open” list).
8. Backfill + Consolidation (OperationRun-Backed)
Decision
Implement a tenant-scoped backfill/consolidation operation backed by OperationRun:
- Maps
acknowledged → triaged - Populates lifecycle fields (
first_seen_at,last_seen_at,times_seen,due_at,sla_days, timestamps) - Computes
recurrence_keyfor drift and consolidates duplicates so only one canonical open finding remains per(tenant_id, recurrence_key) - Due dates for legacy open findings:
due_at = backfill_started_at + sla_days(prevents immediate overdue surge)
Duplicates strategy:
- Choose one canonical row per
(tenant_id, recurrence_key)(prefer open, else most recently seen) - Non-canonical duplicates become terminal (
resolvedwithresolved_reason=consolidated_duplicate) and haverecurrence_keycleared to keep canonical uniqueness simple
Rationale
- Meets OPS-UX requirements (queued toast, progress surfaces, initiator-only terminal notification).
- Makes legacy data usable without requiring manual cleanup.
Alternatives Considered
- Deleting duplicate rows: rejected because the spec explicitly allows legacy rows to remain (and deletions are harder to justify operationally).
9. Capabilities + RBAC Enforcement
Decision
Add tenant-context capabilities:
TENANT_FINDINGS_VIEWTENANT_FINDINGS_TRIAGETENANT_FINDINGS_ASSIGNTENANT_FINDINGS_RESOLVETENANT_FINDINGS_CLOSETENANT_FINDINGS_RISK_ACCEPT
Keep TENANT_FINDINGS_ACKNOWLEDGE as a deprecated alias for v2 triage permission:
- UI enforcement and server-side policy checks treat
ACKNOWLEDGEas sufficient for triage during the migration window.
Rationale
- Aligns with RBAC-UX constitution requirements (registry-only strings, 404/403 semantics).
- Allows incremental rollout without breaking existing role mappings.
Alternatives Considered
- Forcing all tenants to update role mappings at deploy time: rejected (operationally brittle).