TenantAtlas/specs/118-baseline-drift-engine/data-model.md
ahmido 92704a2f7e Spec 118: Resumable baseline evidence capture + snapshot UX (#143)
Implements Spec 118 baseline drift engine improvements:

- Resumable, budget-aware evidence capture for baseline capture/compare runs (resume token + UI action)
- “Why no findings?” reason-code driven explanations and richer run context panels
- Baseline Snapshot resource (list/detail) with fidelity visibility
- Retention command + schedule for pruning baseline-purpose PolicyVersions
- i18n strings for Baseline Compare landing

Verification:
- `vendor/bin/sail bin pint --dirty --format agent`
- `vendor/bin/sail artisan test --compact --filter=Baseline` (159 passed)

Note:
- `docs/audits/redaction-audit-2026-03-04.md` left untracked (not part of PR).

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #143
2026-03-04 22:34:13 +00:00

179 lines
6.9 KiB
Markdown

# Data Model — Spec 118 Golden Master Deep Drift v2
This document describes the data shapes required to implement full-content baseline capture/compare with quota-aware, resumable evidence capture.
Spec reference: `/Users/ahmeddarrazi/Documents/projects/TenantAtlas/specs/118-baseline-drift-engine/spec.md`
## Entities (existing)
### `baseline_profiles` (workspace-owned)
- Purpose: defines baseline name, scope, and (new) capture mode.
- Current fields (from repo):
- `id`, `workspace_id`, `name`, `description`, `version_label`, `status`
- `scope_jsonb`
- `active_snapshot_id`
- `created_by_user_id`
### `baseline_snapshots` (workspace-owned)
- Purpose: immutable baseline snapshot, deduped by a snapshot identity hash.
- Current fields:
- `id`, `workspace_id`, `baseline_profile_id`
- `snapshot_identity_hash` (sha256 string)
- `captured_at`
- `summary_jsonb`
### `baseline_snapshot_items` (workspace-owned; no tenant identifiers)
- Purpose: per-subject baseline evidence for drift evaluation.
- Current fields:
- `baseline_snapshot_id`
- `subject_type` (currently `policy`)
- `subject_external_id` (legacy column name; MUST NOT store tenant external IDs in Spec 118 flows)
- `policy_type`
- `baseline_hash` (fingerprint)
- `meta_jsonb` (metadata + provenance)
### `policy_versions` (tenant-owned evidence)
- Purpose: immutable captured policy content with assignments/scope tags and hashes, used as content-fidelity evidence.
- Current fields (selected):
- `tenant_id`, `policy_id`, `policy_type`, `platform`
- `captured_at`
- `snapshot`, `metadata`, `assignments`, `scope_tags`
- `assignments_hash`, `scope_tags_hash`
### `operation_runs` (tenant-owned operational record)
- Purpose: observable lifecycle for capture/compare operations; `summary_counts` is numeric-only and key-whitelisted; diagnostics go in `context`.
### `findings` (tenant-owned drift outcomes)
- Purpose: drift findings produced by compare; recurrence/lifecycle fields already exist in the repo (incl. `recurrence_key`).
## Proposed changes (Spec 118)
### 1) BaselineProfile: add capture mode
**Add column**: `baseline_profiles.capture_mode` (string)
- Allowed values: `meta_only | opportunistic | full_content`
- Default: `opportunistic` (maintains current behavior unless explicitly enabled)
- Validation: only allow known values
### 2) Baseline snapshot item: introduce a cross-tenant subject key
**Add column**: `baseline_snapshot_items.subject_key` (string)
- Meaning: cross-tenant match key for a subject: `normalized_display_name`
- Normalization rules: trim, collapse internal whitespace, lowercase
- Index: `index(baseline_snapshot_id, policy_type, subject_key)`
Notes:
- Workspace-owned snapshot items MUST NOT persist tenant identifiers. In Spec 118 flows:
- `baseline_snapshot_items.subject_external_id` is treated as an opaque, workspace-safe **subject id** derived from `policy_type + subject_key` (e.g. `sha256(policy_type|subject_key)`), solely to satisfy existing uniqueness/lookup needs.
- Tenant-specific external IDs remain tenant-scoped and live only in tenant-owned tables (`policies`, `inventory_items`, `policy_versions`) and in tenant-scoped `operation_runs.context`.
- `meta_jsonb` stored on snapshot items MUST be baseline-safe (no tenant external IDs, no operation run IDs, no policy version IDs). It should include only cross-tenant metadata like `display_name`, `policy_type`, and a fidelity indicator (`content` vs `meta`).
- Duplicate/ambiguous `subject_key` values within the same policy type are treated as evidence gaps and are not evaluated for drift.
### 3) PolicyVersion: purpose tagging + traceability
**Add columns** (all nullable except purpose):
- `policy_versions.capture_purpose` (string)
- Allowed: `backup | baseline_capture | baseline_compare`
- Default for existing rows: `backup` (or null → treated as `backup` at read time; exact backfill strategy documented in migration plan)
- `policy_versions.operation_run_id` (unsigned bigint, nullable) → FK to `operation_runs.id`
- `policy_versions.baseline_profile_id` (unsigned bigint, nullable) → FK to `baseline_profiles.id`
**Indexes** (for audit/debug + idempotency checks):
- `(tenant_id, policy_id, capture_purpose, captured_at desc)`
- `(tenant_id, capture_purpose, operation_run_id)`
- `(tenant_id, capture_purpose, baseline_profile_id)`
Retention:
- Baseline-purpose evidence is eligible for shorter retention (configurable) than long-term backup evidence.
### 4) OperationRun context: baseline capture/compare contract
Baseline runs should populate `operation_runs.context` with stable, operator-facing keys:
```json
{
"target_scope": {
"entra_tenant_id": "...",
"entra_tenant_name": "...",
"directory_context_id": "..."
},
"baseline_profile_id": 123,
"baseline_snapshot_id": 456,
"capture_mode": "full_content",
"effective_scope": {
"policy_types": ["..."],
"foundation_types": ["..."],
"all_types": ["..."]
},
"baseline_capture": {
"subjects_total": 500,
"evidence_capture": {
"requested": 200,
"succeeded": 180,
"skipped": 10,
"failed": 10,
"throttled": 0
},
"gaps": {
"count": 25,
"top_reasons": ["forbidden", "throttled", "ambiguous_match"]
},
"resume_token": "opaque_token_string"
},
"baseline_compare": {
"inventory_sync_run_id": 999,
"since": "2026-03-03T09:00:00Z",
"coverage": {
"proof": true,
"effective_types": ["..."],
"covered_types": ["..."],
"uncovered_types": ["..."]
},
"fidelity": "content|meta|mixed",
"evidence_capture": {
"requested": 200,
"succeeded": 180,
"skipped": 10,
"failed": 10,
"throttled": 0
},
"evidence_gaps": {
"missing_current": 20,
"ambiguous_match": 3
},
"reason_code": "no_subjects_in_scope|coverage_unproven|evidence_capture_incomplete|rollout_disabled|no_drift_detected|..."
}
}
```
Notes:
- `target_scope` is required for Monitoring UI (“Target” display).
- Rich diagnostics remain in `context`; `summary_counts` stays within the numeric key whitelist.
## Migration strategy
1) Add `baseline_profiles.capture_mode`.
2) Add `baseline_snapshot_items.subject_key` + index.
3) Add `policy_versions.capture_purpose`, `operation_run_id`, `baseline_profile_id` + indexes.
4) Backfill strategy:
- Existing `policy_versions` rows: set `capture_purpose = backup` (or treat null as backup in code until backfill finishes).
- Existing baseline snapshot items: set `subject_key` from stored `meta_jsonb.display_name` when available (else empty; treated as gap in new logic).
## Validation rules
- `capture_mode` must be one of: `meta_only`, `opportunistic`, `full_content`.
- `subject_key` must be non-empty to be eligible for drift evaluation.
- For full-content capture mode:
- Capture/compare runs must record evidence capture stats and gaps.
- Compare must not emit “missing policy” findings for uncovered policy types.