Research — Baseline Operability & Alert Integration (Spec 115)

This document resolves planning unknowns and records implementation decisions.

Decisions

Decision: Treat compare “completeness counters” as OperationRun.summary_counts.total, processed, and failed.
Rationale: Ops-UX contracts already standardize these keys via OperationSummaryKeys::all(); they’re the metrics the UI understands for determinate progress.
Alternatives considered:
- Add new keys like total_count / processed_count / failed_item_count → rejected because it would require expanding OperationSummaryKeys::all() and updating Ops-UX guard tests without a strong benefit.

Decision: Perform auto-close at the end of CompareBaselineToTenantJob (after findings upsert), using the run’s computed “seen” fingerprint set.
Rationale: The job already has the full drift result set for the tenant+profile; it’s the only place that can reliably know what was evaluated.
Alternatives considered:
- Separate queued job for auto-close → rejected (extra run coordination and more complex observability for no benefit).

Decision: Mirror the existing drift lifecycle behavior (as implemented in DriftFindingGenerator):
- New fingerprint → status = new.
- Previously terminal fingerprint (at least resolved) observed again → status = reopened and set reopened_at.
- Existing open finding → do not overwrite workflow status (avoid resetting triaged/in_progress).
Rationale: This preserves operator workflow state and enables “alert only on new/reopened” logic.
Alternatives considered:
- Always set status = new on every compare (current behavior) → rejected because it can overwrite workflow state.

Decision: Set fingerprint_key to a stable string derived from the finding fingerprint (e.g. finding_fingerprint:{fingerprint}) for baseline drift events.
Rationale: Alert delivery dedupe uses fingerprint_key (or idempotency_key) via AlertFingerprintService.
Alternatives considered:
- Use finding:{id} → rejected because it ties dedupe to a DB surrogate rather than the domain fingerprint.

Decision: Add two new alert event types and produce them in EvaluateAlertsJob:
- baseline_high_drift: for baseline compare findings (source = baseline.compare) that are new/reopened in the evaluation window and meet severity threshold.
- baseline_compare_failed: for OperationRun.type = baseline_compare with outcome in {failed, partially_succeeded} in the evaluation window.
Rationale: The spec requires strict separation from generic drift alerts and precise triggering rules.
Alternatives considered:
- Reuse high_drift / compare_failed → rejected because it would mix baseline and non-baseline meaning.

Decision: Reuse the existing per-rule cooldown + quiet-hours suppression implemented in AlertDispatchService (no baseline-specific cooldown setting).
Rationale: Matches spec clarification and existing patterns.

Decision: Implement baseline settings using the existing SettingsRegistry/SettingsResolver/SettingsWriter system with new keys under a new baseline domain:
- baseline.severity_mapping (json map with restricted keys)
- baseline.alert_min_severity (string)
- baseline.auto_close_enabled (bool)
Rationale: This matches existing settings infrastructure and ensures consistent “effective value” semantics.

Decision: Keep baseline profile CRUD as workspace-owned (non-tenant scoped) and baseline compare monitoring as tenant-context only.
Rationale: Matches SCOPE-001 and spec FR-018.

Ops-UX allowed summary keys are defined in App\Support\OpsUx\OperationSummaryKeys.
Drift lifecycle patterns exist in App\Services\Drift\DriftFindingGenerator (reopen + resolve stale).
Alert dispatch dedupe/cooldown/quiet-hours are centralized in App\Services\Alerts\AlertDispatchService and AlertFingerprintService.
Workspace settings are handled by App\Support\Settings\SettingsRegistry + SettingsResolver + SettingsWriter.