TenantAtlas/specs/115-baseline-operability-alerts/research.md
2026-03-01 03:23:39 +01:00

4.4 KiB
Raw Blame History

Research — Baseline Operability & Alert Integration (Spec 115)

This document resolves planning unknowns and records implementation decisions.

Decisions

1) Completeness counters for safe auto-close

  • Decision: Treat compare “completeness counters” as OperationRun.summary_counts.total, processed, and failed.
  • Rationale: Ops-UX contracts already standardize these keys via OperationSummaryKeys::all(); theyre the metrics the UI understands for determinate progress.
  • Alternatives considered:
    • Add new keys like total_count / processed_count / failed_item_count → rejected because it would require expanding OperationSummaryKeys::all() and updating Ops-UX guard tests without a strong benefit.

2) Where auto-close runs

  • Decision: Perform auto-close at the end of CompareBaselineToTenantJob (after findings upsert), using the runs computed “seen” fingerprint set.
  • Rationale: The job already has the full drift result set for the tenant+profile; its the only place that can reliably know what was evaluated.
  • Alternatives considered:
    • Separate queued job for auto-close → rejected (extra run coordination and more complex observability for no benefit).

3) Baseline finding lifecycle semantics (new vs reopened vs existing open)

  • Decision: Mirror the existing drift lifecycle behavior (as implemented in DriftFindingGenerator):
    • New fingerprint → status = new.
    • Previously terminal fingerprint (at least resolved) observed again → status = reopened and set reopened_at.
    • Existing open finding → do not overwrite workflow status (avoid resetting triaged/in_progress).
  • Rationale: This preserves operator workflow state and enables “alert only on new/reopened” logic.
  • Alternatives considered:
    • Always set status = new on every compare (current behavior) → rejected because it can overwrite workflow state.

4) Alert deduplication key for baseline drift

  • Decision: Set fingerprint_key to a stable string derived from the finding fingerprint (e.g. finding_fingerprint:{fingerprint}) for baseline drift events.
  • Rationale: Alert delivery dedupe uses fingerprint_key (or idempotency_key) via AlertFingerprintService.
  • Alternatives considered:
    • Use finding:{id} → rejected because it ties dedupe to a DB surrogate rather than the domain fingerprint.

5) Baseline-specific event types

  • Decision: Add two new alert event types and produce them in EvaluateAlertsJob:
    • baseline_high_drift: for baseline compare findings (source = baseline.compare) that are new/reopened in the evaluation window and meet severity threshold.
    • baseline_compare_failed: for OperationRun.type = baseline_compare with outcome in {failed, partially_succeeded} in the evaluation window.
  • Rationale: The spec requires strict separation from generic drift alerts and precise triggering rules.
  • Alternatives considered:
    • Reuse high_drift / compare_failed → rejected because it would mix baseline and non-baseline meaning.

6) Cooldown behavior for baseline_compare_failed

  • Decision: Reuse the existing per-rule cooldown + quiet-hours suppression implemented in AlertDispatchService (no baseline-specific cooldown setting).
  • Rationale: Matches spec clarification and existing patterns.

7) Workspace settings implementation approach

  • Decision: Implement baseline settings using the existing SettingsRegistry/SettingsResolver/SettingsWriter system with new keys under a new baseline domain:
    • baseline.severity_mapping (json map with restricted keys)
    • baseline.alert_min_severity (string)
    • baseline.auto_close_enabled (bool)
  • Rationale: This matches existing settings infrastructure and ensures consistent “effective value” semantics.

8) Information architecture (IA) and planes

  • Decision: Keep baseline profile CRUD as workspace-owned (non-tenant scoped) and baseline compare monitoring as tenant-context only.
  • Rationale: Matches SCOPE-001 and spec FR-018.

Notes / Repo Facts Used

  • Ops-UX allowed summary keys are defined in App\Support\OpsUx\OperationSummaryKeys.
  • Drift lifecycle patterns exist in App\Services\Drift\DriftFindingGenerator (reopen + resolve stale).
  • Alert dispatch dedupe/cooldown/quiet-hours are centralized in App\Services\Alerts\AlertDispatchService and AlertFingerprintService.
  • Workspace settings are handled by App\Support\Settings\SettingsRegistry + SettingsResolver + SettingsWriter.