ahmido/TenantAtlas

Fork 0

Ahmed Darrazi 008beb88c0 feat(111): findings workflow + SLA settings

2026-02-25 02:45:20 +01:00

8.4 KiB

Raw Blame History

Research: 111 — Findings Workflow V2 + SLA

Date: 2026-02-24
Branch: 111-findings-workflow-sla

1. Status Model + Legacy Mapping

Decision

Keep findings.status as a string column and expand allowed v2 values. Preserve legacy acknowledged rows for compatibility, but treat acknowledged as triaged in the v2 workflow surface and migrate it during backfill.

Rationale

Existing findings already use new|acknowledged|resolved with acknowledged_at/by fields.
Mapping acknowledged → triaged preserves intent while enabling the new workflow.
Avoids high-risk data migrations that try to rewrite history beyond what the spec requires.

Alternatives Considered

Dropping the legacy acknowledged status immediately and forcing a hard migration in one deploy: rejected due to rollout risk.

2. Workflow Enforcement + Timestamps

Decision

Enforce transitions server-side via a dedicated workflow service (single entrypoint used by Filament actions and any future API surfaces). Update timestamps on state changes:

triaged_at set on new|reopened → triaged
in_progress_at set on triaged → in_progress
resolved_at + resolved_reason set on resolve
closed_at + closed_reason + closed_by_user_id set on close and risk accept
reopened_at set on reopen

When reopening, clear terminal state fields relevant to the previous terminal status (e.g., clear resolved_at/reason when moving to reopened).

Rationale

Keeps status validation consistent across UI and background jobs.
Timestamp fields provide direct auditability for “when did we triage/close”.
Clearing terminal fields prevents inconsistent states (e.g., status=reopened with resolved_at still set).

Alternatives Considered

Implementing all transitions as ad-hoc model mutations across multiple resources: rejected (harder to test and easy to drift).

3. SLA Policy Storage (SettingsRegistry)

Decision

Add a workspace-resolvable setting:

domain = findings
key = sla_days
type = json
systemDefault:
- critical: 3
- high: 7
- medium: 14
- low: 30

Expose it via Workspace Settings UI as a JSON textarea, following the existing drift.severity_mapping pattern.

Rationale

Existing code already uses SettingsResolver + SettingsRegistry for drift severity mapping.
Keeps SLA policy queryable and adjustable without code deploys.

Alternatives Considered

Hardcoding SLA days in code/config: rejected (non-configurable and harder to tune per workspace).

4. due_at Computation Semantics

Decision

On create (new finding): set sla_days (resolved from settings) and set due_at = first_seen_at + sla_days.
On reopen (manual or automatic): reset due_at = now + sla_days(current severity) and update sla_days to the current policy value.
Severity changes while a finding remains open do not retroactively change due_at unless the finding is reopened (matches spec assumptions).

Rationale

Allows stable deadlines during remediation while still resetting on recurrence/reopen.
Reduces surprising “deadline moved” behavior during open triage.

Alternatives Considered

Recomputing due_at on every detection run for open findings: rejected (deadlines drift and become hard to reason about).

5. SLA Due Alert Event (Tenant-Level Summary)

Decision

Implement an SLA due producer in EvaluateAlertsJob that emits one event per tenant when a tenant has newly-overdue open findings in the evaluation window:

Eligibility for producing an event (per tenant):
- due_at <= now()
- due_at > windowStart (newly overdue since last evaluation)
- status IN (new, triaged, in_progress, reopened)
Event summarizes current overdue counts for that tenant (not just newly overdue), so the alert body reflects the real state at emission time.

Event fields:

event_type = sla_due
fingerprint_key = sla_due:tenant:{tenant_id}
severity = max severity among overdue open findings (critical if any critical overdue exists)
metadata contains counts only (no per-finding payloads)

Rationale

Avoids creating suppressed alert_deliveries every minute for persistently overdue tenants (no prune job exists for alert_deliveries today).
Aligns “due” semantics with the due moment: a tenant produces an event when something crosses due.

Alternatives Considered

Emitting an SLA due event on every evaluation run when overdue exists: rejected due to alert_deliveries table growth and suppressed-delivery noise.
Tracking last-emitted state per tenant in a new table: rejected for v1 (adds schema and state complexity).

6. Drift Recurrence: Stable recurrence_key + Canonical Row

Decision

Add recurrence_key (64-char hex) to findings and treat it as the stable identity for drift recurrence. For drift findings:

Compute recurrence_key = sha256("drift:{tenant_id}:{scope_key}:{subject_type}:{subject_external_id}:{dimension}")
Upsert drift findings by (tenant_id, recurrence_key)
Set drift finding fingerprint = recurrence_key for canonical drift rows going forward

dimension is stable and derived from evidence kind and change type:

Policy snapshot drift: policy_snapshot:{change_type}
Assignments drift: policy_assignments
Scope tags drift: policy_scope_tags
Baseline compare drift: baseline_compare:{change_type}

Rationale

Prevents “new row per re-drift” even when baseline/current hashes change.
Avoids conflicts with legacy drift fingerprints during consolidation because new canonical drift fingerprints are stable and distinct.

Alternatives Considered

Keeping drift fingerprint as baseline/current hash-based and updating it on the canonical row: rejected because it can collide with existing legacy rows (unique (tenant_id, fingerprint) constraint).

7. Drift Stale Auto-Resolve

Decision

When generating drift findings for a scope/run, auto-resolve drift findings that were previously open for that scope but are not detected in the latest run:

Filter: finding_type=drift, scope_key=..., status IN (new, triaged, in_progress, reopened)
Not seen in run’s recurrence_key set
Resolve reason: no_longer_detected

Rationale

Keeps “Open” findings aligned with current observed state.
Matches existing generator patterns (permission posture / Entra roles resolve stale records).

Alternatives Considered

Leaving stale findings open indefinitely: rejected (increases noise and breaks trust in “Open” list).

8. Backfill + Consolidation (OperationRun-Backed)

Decision

Implement a tenant-scoped backfill/consolidation operation backed by OperationRun:

Maps acknowledged → triaged
Populates lifecycle fields (first_seen_at, last_seen_at, times_seen, due_at, sla_days, timestamps)
Computes recurrence_key for drift and consolidates duplicates so only one canonical open finding remains per (tenant_id, recurrence_key)
Due dates for legacy open findings: due_at = backfill_started_at + sla_days (prevents immediate overdue surge)

Duplicates strategy:

Choose one canonical row per (tenant_id, recurrence_key) (prefer open, else most recently seen)
Non-canonical duplicates become terminal (resolved with resolved_reason=consolidated_duplicate) and have recurrence_key cleared to keep canonical uniqueness simple

Rationale

Meets OPS-UX requirements (queued toast, progress surfaces, initiator-only terminal notification).
Makes legacy data usable without requiring manual cleanup.

Alternatives Considered

Deleting duplicate rows: rejected because the spec explicitly allows legacy rows to remain (and deletions are harder to justify operationally).

9. Capabilities + RBAC Enforcement

Decision

Add tenant-context capabilities:

TENANT_FINDINGS_VIEW
TENANT_FINDINGS_TRIAGE
TENANT_FINDINGS_ASSIGN
TENANT_FINDINGS_RESOLVE
TENANT_FINDINGS_CLOSE
TENANT_FINDINGS_RISK_ACCEPT

Keep TENANT_FINDINGS_ACKNOWLEDGE as a deprecated alias for v2 triage permission:

UI enforcement and server-side policy checks treat ACKNOWLEDGE as sufficient for triage during the migration window.

Rationale

Aligns with RBAC-UX constitution requirements (registry-only strings, 404/403 semantics).
Allows incremental rollout without breaking existing role mappings.

Alternatives Considered

Forcing all tenants to update role mappings at deploy time: rejected (operationally brittle).

8.4 KiB Raw Blame History Unescape Escape

Research: 111 — Findings Workflow V2 + SLA

1. Status Model + Legacy Mapping

Decision

Rationale

Alternatives Considered

2. Workflow Enforcement + Timestamps

Decision

Rationale

Alternatives Considered

3. SLA Policy Storage (SettingsRegistry)

Decision

Rationale

Alternatives Considered

4. due_at Computation Semantics

Decision

Rationale

Alternatives Considered

5. SLA Due Alert Event (Tenant-Level Summary)

Decision

Rationale

Alternatives Considered

6. Drift Recurrence: Stable recurrence_key + Canonical Row

Decision

Rationale

Alternatives Considered

7. Drift Stale Auto-Resolve

Decision

Rationale

Alternatives Considered

8. Backfill + Consolidation (OperationRun-Backed)

Decision

Rationale

Alternatives Considered

9. Capabilities + RBAC Enforcement

Decision

Rationale

Alternatives Considered

8.4 KiB

Raw Blame History