TenantAtlas/specs/115-baseline-operability-alerts/research.md
2026-03-01 03:23:39 +01:00

62 lines
4.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Research — Baseline Operability & Alert Integration (Spec 115)
This document resolves planning unknowns and records implementation decisions.
## Decisions
### 1) Completeness counters for safe auto-close
- Decision: Treat compare “completeness counters” as `OperationRun.summary_counts.total`, `processed`, and `failed`.
- Rationale: Ops-UX contracts already standardize these keys via `OperationSummaryKeys::all()`; theyre the metrics the UI understands for determinate progress.
- Alternatives considered:
- Add new keys like `total_count` / `processed_count` / `failed_item_count` → rejected because it would require expanding `OperationSummaryKeys::all()` and updating Ops-UX guard tests without a strong benefit.
### 2) Where auto-close runs
- Decision: Perform auto-close at the end of `CompareBaselineToTenantJob` (after findings upsert), using the runs computed “seen” fingerprint set.
- Rationale: The job already has the full drift result set for the tenant+profile; its the only place that can reliably know what was evaluated.
- Alternatives considered:
- Separate queued job for auto-close → rejected (extra run coordination and more complex observability for no benefit).
### 3) Baseline finding lifecycle semantics (new vs reopened vs existing open)
- Decision: Mirror the existing drift lifecycle behavior (as implemented in `DriftFindingGenerator`):
- New fingerprint → `status = new`.
- Previously terminal fingerprint (at least `resolved`) observed again → `status = reopened` and set `reopened_at`.
- Existing open finding → do not overwrite workflow status (avoid resetting `triaged`/`in_progress`).
- Rationale: This preserves operator workflow state and enables “alert only on new/reopened” logic.
- Alternatives considered:
- Always set `status = new` on every compare (current behavior) → rejected because it can overwrite workflow state.
### 4) Alert deduplication key for baseline drift
- Decision: Set `fingerprint_key` to a stable string derived from the finding fingerprint (e.g. `finding_fingerprint:{fingerprint}`) for baseline drift events.
- Rationale: Alert delivery dedupe uses `fingerprint_key` (or `idempotency_key`) via `AlertFingerprintService`.
- Alternatives considered:
- Use `finding:{id}` → rejected because it ties dedupe to a DB surrogate rather than the domain fingerprint.
### 5) Baseline-specific event types
- Decision: Add two new alert event types and produce them in `EvaluateAlertsJob`:
- `baseline_high_drift`: for baseline compare findings (`source = baseline.compare`) that are `new`/`reopened` in the evaluation window and meet severity threshold.
- `baseline_compare_failed`: for `OperationRun.type = baseline_compare` with `outcome in {failed, partially_succeeded}` in the evaluation window.
- Rationale: The spec requires strict separation from generic drift alerts and precise triggering rules.
- Alternatives considered:
- Reuse `high_drift` / `compare_failed` → rejected because it would mix baseline and non-baseline meaning.
### 6) Cooldown behavior for baseline_compare_failed
- Decision: Reuse the existing per-rule cooldown + quiet-hours suppression implemented in `AlertDispatchService` (no baseline-specific cooldown setting).
- Rationale: Matches spec clarification and existing patterns.
### 7) Workspace settings implementation approach
- Decision: Implement baseline settings using the existing `SettingsRegistry`/`SettingsResolver`/`SettingsWriter` system with new keys under a new `baseline` domain:
- `baseline.severity_mapping` (json map with restricted keys)
- `baseline.alert_min_severity` (string)
- `baseline.auto_close_enabled` (bool)
- Rationale: This matches existing settings infrastructure and ensures consistent “effective value” semantics.
### 8) Information architecture (IA) and planes
- Decision: Keep baseline profile CRUD as workspace-owned (non-tenant scoped) and baseline compare monitoring as tenant-context only.
- Rationale: Matches SCOPE-001 and spec FR-018.
## Notes / Repo Facts Used
- Ops-UX allowed summary keys are defined in `App\Support\OpsUx\OperationSummaryKeys`.
- Drift lifecycle patterns exist in `App\Services\Drift\DriftFindingGenerator` (reopen + resolve stale).
- Alert dispatch dedupe/cooldown/quiet-hours are centralized in `App\Services\Alerts\AlertDispatchService` and `AlertFingerprintService`.
- Workspace settings are handled by `App\Support\Settings\SettingsRegistry` + `SettingsResolver` + `SettingsWriter`.