TenantAtlas/specs/111-findings-workflow-sla/research.md
2026-02-25 02:45:20 +01:00

8.4 KiB
Raw Blame History

Research: 111 — Findings Workflow V2 + SLA

Date: 2026-02-24
Branch: 111-findings-workflow-sla


1. Status Model + Legacy Mapping

Decision

Keep findings.status as a string column and expand allowed v2 values. Preserve legacy acknowledged rows for compatibility, but treat acknowledged as triaged in the v2 workflow surface and migrate it during backfill.

Rationale

  • Existing findings already use new|acknowledged|resolved with acknowledged_at/by fields.
  • Mapping acknowledged → triaged preserves intent while enabling the new workflow.
  • Avoids high-risk data migrations that try to rewrite history beyond what the spec requires.

Alternatives Considered

  • Dropping the legacy acknowledged status immediately and forcing a hard migration in one deploy: rejected due to rollout risk.

2. Workflow Enforcement + Timestamps

Decision

Enforce transitions server-side via a dedicated workflow service (single entrypoint used by Filament actions and any future API surfaces). Update timestamps on state changes:

  • triaged_at set on new|reopened → triaged
  • in_progress_at set on triaged → in_progress
  • resolved_at + resolved_reason set on resolve
  • closed_at + closed_reason + closed_by_user_id set on close and risk accept
  • reopened_at set on reopen

When reopening, clear terminal state fields relevant to the previous terminal status (e.g., clear resolved_at/reason when moving to reopened).

Rationale

  • Keeps status validation consistent across UI and background jobs.
  • Timestamp fields provide direct auditability for “when did we triage/close”.
  • Clearing terminal fields prevents inconsistent states (e.g., status=reopened with resolved_at still set).

Alternatives Considered

  • Implementing all transitions as ad-hoc model mutations across multiple resources: rejected (harder to test and easy to drift).

3. SLA Policy Storage (SettingsRegistry)

Decision

Add a workspace-resolvable setting:

  • domain = findings
  • key = sla_days
  • type = json
  • systemDefault:
    • critical: 3
    • high: 7
    • medium: 14
    • low: 30

Expose it via Workspace Settings UI as a JSON textarea, following the existing drift.severity_mapping pattern.

Rationale

  • Existing code already uses SettingsResolver + SettingsRegistry for drift severity mapping.
  • Keeps SLA policy queryable and adjustable without code deploys.

Alternatives Considered

  • Hardcoding SLA days in code/config: rejected (non-configurable and harder to tune per workspace).

4. due_at Computation Semantics

Decision

  • On create (new finding): set sla_days (resolved from settings) and set due_at = first_seen_at + sla_days.
  • On reopen (manual or automatic): reset due_at = now + sla_days(current severity) and update sla_days to the current policy value.
  • Severity changes while a finding remains open do not retroactively change due_at unless the finding is reopened (matches spec assumptions).

Rationale

  • Allows stable deadlines during remediation while still resetting on recurrence/reopen.
  • Reduces surprising “deadline moved” behavior during open triage.

Alternatives Considered

  • Recomputing due_at on every detection run for open findings: rejected (deadlines drift and become hard to reason about).

5. SLA Due Alert Event (Tenant-Level Summary)

Decision

Implement an SLA due producer in EvaluateAlertsJob that emits one event per tenant when a tenant has newly-overdue open findings in the evaluation window:

  • Eligibility for producing an event (per tenant):
    • due_at <= now()
    • due_at > windowStart (newly overdue since last evaluation)
    • status IN (new, triaged, in_progress, reopened)
  • Event summarizes current overdue counts for that tenant (not just newly overdue), so the alert body reflects the real state at emission time.

Event fields:

  • event_type = sla_due
  • fingerprint_key = sla_due:tenant:{tenant_id}
  • severity = max severity among overdue open findings (critical if any critical overdue exists)
  • metadata contains counts only (no per-finding payloads)

Rationale

  • Avoids creating suppressed alert_deliveries every minute for persistently overdue tenants (no prune job exists for alert_deliveries today).
  • Aligns “due” semantics with the due moment: a tenant produces an event when something crosses due.

Alternatives Considered

  • Emitting an SLA due event on every evaluation run when overdue exists: rejected due to alert_deliveries table growth and suppressed-delivery noise.
  • Tracking last-emitted state per tenant in a new table: rejected for v1 (adds schema and state complexity).

6. Drift Recurrence: Stable recurrence_key + Canonical Row

Decision

Add recurrence_key (64-char hex) to findings and treat it as the stable identity for drift recurrence. For drift findings:

  • Compute recurrence_key = sha256("drift:{tenant_id}:{scope_key}:{subject_type}:{subject_external_id}:{dimension}")
  • Upsert drift findings by (tenant_id, recurrence_key)
  • Set drift finding fingerprint = recurrence_key for canonical drift rows going forward

dimension is stable and derived from evidence kind and change type:

  • Policy snapshot drift: policy_snapshot:{change_type}
  • Assignments drift: policy_assignments
  • Scope tags drift: policy_scope_tags
  • Baseline compare drift: baseline_compare:{change_type}

Rationale

  • Prevents “new row per re-drift” even when baseline/current hashes change.
  • Avoids conflicts with legacy drift fingerprints during consolidation because new canonical drift fingerprints are stable and distinct.

Alternatives Considered

  • Keeping drift fingerprint as baseline/current hash-based and updating it on the canonical row: rejected because it can collide with existing legacy rows (unique (tenant_id, fingerprint) constraint).

7. Drift Stale Auto-Resolve

Decision

When generating drift findings for a scope/run, auto-resolve drift findings that were previously open for that scope but are not detected in the latest run:

  • Filter: finding_type=drift, scope_key=..., status IN (new, triaged, in_progress, reopened)
  • Not seen in runs recurrence_key set
  • Resolve reason: no_longer_detected

Rationale

  • Keeps “Open” findings aligned with current observed state.
  • Matches existing generator patterns (permission posture / Entra roles resolve stale records).

Alternatives Considered

  • Leaving stale findings open indefinitely: rejected (increases noise and breaks trust in “Open” list).

8. Backfill + Consolidation (OperationRun-Backed)

Decision

Implement a tenant-scoped backfill/consolidation operation backed by OperationRun:

  • Maps acknowledged → triaged
  • Populates lifecycle fields (first_seen_at, last_seen_at, times_seen, due_at, sla_days, timestamps)
  • Computes recurrence_key for drift and consolidates duplicates so only one canonical open finding remains per (tenant_id, recurrence_key)
  • Due dates for legacy open findings: due_at = backfill_started_at + sla_days (prevents immediate overdue surge)

Duplicates strategy:

  • Choose one canonical row per (tenant_id, recurrence_key) (prefer open, else most recently seen)
  • Non-canonical duplicates become terminal (resolved with resolved_reason=consolidated_duplicate) and have recurrence_key cleared to keep canonical uniqueness simple

Rationale

  • Meets OPS-UX requirements (queued toast, progress surfaces, initiator-only terminal notification).
  • Makes legacy data usable without requiring manual cleanup.

Alternatives Considered

  • Deleting duplicate rows: rejected because the spec explicitly allows legacy rows to remain (and deletions are harder to justify operationally).

9. Capabilities + RBAC Enforcement

Decision

Add tenant-context capabilities:

  • TENANT_FINDINGS_VIEW
  • TENANT_FINDINGS_TRIAGE
  • TENANT_FINDINGS_ASSIGN
  • TENANT_FINDINGS_RESOLVE
  • TENANT_FINDINGS_CLOSE
  • TENANT_FINDINGS_RISK_ACCEPT

Keep TENANT_FINDINGS_ACKNOWLEDGE as a deprecated alias for v2 triage permission:

  • UI enforcement and server-side policy checks treat ACKNOWLEDGE as sufficient for triage during the migration window.

Rationale

  • Aligns with RBAC-UX constitution requirements (registry-only strings, 404/403 semantics).
  • Allows incremental rollout without breaking existing role mappings.

Alternatives Considered

  • Forcing all tenants to update role mappings at deploy time: rejected (operationally brittle).