Implements spec 111 (Findings workflow + SLA) and fixes Workspace findings SLA settings UX/validation. Key changes: - Findings workflow service + SLA policy and alerting. - Workspace settings: allow partial SLA overrides without auto-filling unset severities in the UI; effective values still resolve via defaults. - New migrations, jobs, command, UI/resource updates, and comprehensive test coverage. Tests: - `vendor/bin/sail artisan test --compact` (1779 passed, 8 skipped). Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de> Reviewed-on: #135
194 lines
8.4 KiB
Markdown
194 lines
8.4 KiB
Markdown
# Research: 111 — Findings Workflow V2 + SLA
|
||
|
||
**Date**: 2026-02-24
|
||
**Branch**: `111-findings-workflow-sla`
|
||
|
||
---
|
||
|
||
## 1. Status Model + Legacy Mapping
|
||
|
||
### Decision
|
||
Keep `findings.status` as a string column and expand allowed v2 values. Preserve legacy `acknowledged` rows for compatibility, but treat `acknowledged` as `triaged` in the v2 workflow surface and migrate it during backfill.
|
||
|
||
### Rationale
|
||
- Existing findings already use `new|acknowledged|resolved` with `acknowledged_at/by` fields.
|
||
- Mapping `acknowledged → triaged` preserves intent while enabling the new workflow.
|
||
- Avoids high-risk data migrations that try to rewrite history beyond what the spec requires.
|
||
|
||
### Alternatives Considered
|
||
- Dropping the legacy `acknowledged` status immediately and forcing a hard migration in one deploy: rejected due to rollout risk.
|
||
|
||
---
|
||
|
||
## 2. Workflow Enforcement + Timestamps
|
||
|
||
### Decision
|
||
Enforce transitions server-side via a dedicated workflow service (single entrypoint used by Filament actions and any future API surfaces). Update timestamps on state changes:
|
||
- `triaged_at` set on `new|reopened → triaged`
|
||
- `in_progress_at` set on `triaged → in_progress`
|
||
- `resolved_at` + `resolved_reason` set on resolve
|
||
- `closed_at` + `closed_reason` + `closed_by_user_id` set on close and risk accept
|
||
- `reopened_at` set on reopen
|
||
|
||
When reopening, clear terminal state fields relevant to the previous terminal status (e.g., clear `resolved_at/reason` when moving to `reopened`).
|
||
|
||
### Rationale
|
||
- Keeps status validation consistent across UI and background jobs.
|
||
- Timestamp fields provide direct auditability for “when did we triage/close”.
|
||
- Clearing terminal fields prevents inconsistent states (e.g., `status=reopened` with `resolved_at` still set).
|
||
|
||
### Alternatives Considered
|
||
- Implementing all transitions as ad-hoc model mutations across multiple resources: rejected (harder to test and easy to drift).
|
||
|
||
---
|
||
|
||
## 3. SLA Policy Storage (SettingsRegistry)
|
||
|
||
### Decision
|
||
Add a workspace-resolvable setting:
|
||
- `domain = findings`
|
||
- `key = sla_days`
|
||
- `type = json`
|
||
- `systemDefault`:
|
||
- critical: 3
|
||
- high: 7
|
||
- medium: 14
|
||
- low: 30
|
||
|
||
Expose it via Workspace Settings UI as a JSON textarea, following the existing `drift.severity_mapping` pattern.
|
||
|
||
### Rationale
|
||
- Existing code already uses `SettingsResolver` + `SettingsRegistry` for drift severity mapping.
|
||
- Keeps SLA policy queryable and adjustable without code deploys.
|
||
|
||
### Alternatives Considered
|
||
- Hardcoding SLA days in code/config: rejected (non-configurable and harder to tune per workspace).
|
||
|
||
---
|
||
|
||
## 4. due_at Computation Semantics
|
||
|
||
### Decision
|
||
- On create (new finding): set `sla_days` (resolved from settings) and set `due_at = first_seen_at + sla_days`.
|
||
- On reopen (manual or automatic): reset `due_at = now + sla_days(current severity)` and update `sla_days` to the current policy value.
|
||
- Severity changes while a finding remains open do not retroactively change `due_at` unless the finding is reopened (matches spec assumptions).
|
||
|
||
### Rationale
|
||
- Allows stable deadlines during remediation while still resetting on recurrence/reopen.
|
||
- Reduces surprising “deadline moved” behavior during open triage.
|
||
|
||
### Alternatives Considered
|
||
- Recomputing `due_at` on every detection run for open findings: rejected (deadlines drift and become hard to reason about).
|
||
|
||
---
|
||
|
||
## 5. SLA Due Alert Event (Tenant-Level Summary)
|
||
|
||
### Decision
|
||
Implement an SLA due producer in `EvaluateAlertsJob` that emits **one event per tenant** when a tenant has **newly-overdue** open findings in the evaluation window:
|
||
- Eligibility for producing an event (per tenant):
|
||
- `due_at <= now()`
|
||
- `due_at > windowStart` (newly overdue since last evaluation)
|
||
- `status IN (new, triaged, in_progress, reopened)`
|
||
- Event summarizes **current** overdue counts for that tenant (not just newly overdue), so the alert body reflects the real state at emission time.
|
||
|
||
Event fields:
|
||
- `event_type = sla_due`
|
||
- `fingerprint_key = sla_due:tenant:{tenant_id}`
|
||
- `severity = max severity among overdue open findings` (critical if any critical overdue exists)
|
||
- `metadata` contains counts only (no per-finding payloads)
|
||
|
||
### Rationale
|
||
- Avoids creating suppressed `alert_deliveries` every minute for persistently overdue tenants (no prune job exists for `alert_deliveries` today).
|
||
- Aligns “due” semantics with the due moment: a tenant produces an event when something crosses due.
|
||
|
||
### Alternatives Considered
|
||
- Emitting an SLA due event on every evaluation run when overdue exists: rejected due to `alert_deliveries` table growth and suppressed-delivery noise.
|
||
- Tracking last-emitted state per tenant in a new table: rejected for v1 (adds schema and state complexity).
|
||
|
||
---
|
||
|
||
## 6. Drift Recurrence: Stable recurrence_key + Canonical Row
|
||
|
||
### Decision
|
||
Add `recurrence_key` (64-char hex) to `findings` and treat it as the stable identity for drift recurrence. For drift findings:
|
||
- Compute `recurrence_key = sha256("drift:{tenant_id}:{scope_key}:{subject_type}:{subject_external_id}:{dimension}")`
|
||
- Upsert drift findings by `(tenant_id, recurrence_key)`
|
||
- Set drift finding `fingerprint = recurrence_key` for canonical drift rows going forward
|
||
|
||
`dimension` is stable and derived from evidence kind and change type:
|
||
- Policy snapshot drift: `policy_snapshot:{change_type}`
|
||
- Assignments drift: `policy_assignments`
|
||
- Scope tags drift: `policy_scope_tags`
|
||
- Baseline compare drift: `baseline_compare:{change_type}`
|
||
|
||
### Rationale
|
||
- Prevents “new row per re-drift” even when baseline/current hashes change.
|
||
- Avoids conflicts with legacy drift fingerprints during consolidation because new canonical drift fingerprints are stable and distinct.
|
||
|
||
### Alternatives Considered
|
||
- Keeping drift fingerprint as baseline/current hash-based and updating it on the canonical row: rejected because it can collide with existing legacy rows (unique `(tenant_id, fingerprint)` constraint).
|
||
|
||
---
|
||
|
||
## 7. Drift Stale Auto-Resolve
|
||
|
||
### Decision
|
||
When generating drift findings for a scope/run, auto-resolve drift findings that were previously open for that scope but are not detected in the latest run:
|
||
- Filter: `finding_type=drift`, `scope_key=...`, `status IN (new, triaged, in_progress, reopened)`
|
||
- Not seen in run’s `recurrence_key` set
|
||
- Resolve reason: `no_longer_detected`
|
||
|
||
### Rationale
|
||
- Keeps “Open” findings aligned with current observed state.
|
||
- Matches existing generator patterns (permission posture / Entra roles resolve stale records).
|
||
|
||
### Alternatives Considered
|
||
- Leaving stale findings open indefinitely: rejected (increases noise and breaks trust in “Open” list).
|
||
|
||
---
|
||
|
||
## 8. Backfill + Consolidation (OperationRun-Backed)
|
||
|
||
### Decision
|
||
Implement a tenant-scoped backfill/consolidation operation backed by `OperationRun`:
|
||
- Maps `acknowledged → triaged`
|
||
- Populates lifecycle fields (`first_seen_at`, `last_seen_at`, `times_seen`, `due_at`, `sla_days`, timestamps)
|
||
- Computes `recurrence_key` for drift and consolidates duplicates so only one canonical open finding remains per `(tenant_id, recurrence_key)`
|
||
- Due dates for legacy open findings: `due_at = backfill_started_at + sla_days` (prevents immediate overdue surge)
|
||
|
||
Duplicates strategy:
|
||
- Choose one canonical row per `(tenant_id, recurrence_key)` (prefer open, else most recently seen)
|
||
- Non-canonical duplicates become terminal (`resolved` with `resolved_reason=consolidated_duplicate`) and have `recurrence_key` cleared to keep canonical uniqueness simple
|
||
|
||
### Rationale
|
||
- Meets OPS-UX requirements (queued toast, progress surfaces, initiator-only terminal notification).
|
||
- Makes legacy data usable without requiring manual cleanup.
|
||
|
||
### Alternatives Considered
|
||
- Deleting duplicate rows: rejected because the spec explicitly allows legacy rows to remain (and deletions are harder to justify operationally).
|
||
|
||
---
|
||
|
||
## 9. Capabilities + RBAC Enforcement
|
||
|
||
### Decision
|
||
Add tenant-context capabilities:
|
||
- `TENANT_FINDINGS_VIEW`
|
||
- `TENANT_FINDINGS_TRIAGE`
|
||
- `TENANT_FINDINGS_ASSIGN`
|
||
- `TENANT_FINDINGS_RESOLVE`
|
||
- `TENANT_FINDINGS_CLOSE`
|
||
- `TENANT_FINDINGS_RISK_ACCEPT`
|
||
|
||
Keep `TENANT_FINDINGS_ACKNOWLEDGE` as a deprecated alias for v2 triage permission:
|
||
- UI enforcement and server-side policy checks treat `ACKNOWLEDGE` as sufficient for triage during the migration window.
|
||
|
||
### Rationale
|
||
- Aligns with RBAC-UX constitution requirements (registry-only strings, 404/403 semantics).
|
||
- Allows incremental rollout without breaking existing role mappings.
|
||
|
||
### Alternatives Considered
|
||
- Forcing all tenants to update role mappings at deploy time: rejected (operationally brittle).
|
||
|