# Data Model — Spec 118 Golden Master Deep Drift v2 This document describes the data shapes required to implement full-content baseline capture/compare with quota-aware, resumable evidence capture. Spec reference: `/Users/ahmeddarrazi/Documents/projects/TenantAtlas/specs/118-baseline-drift-engine/spec.md` ## Entities (existing) ### `baseline_profiles` (workspace-owned) - Purpose: defines baseline name, scope, and (new) capture mode. - Current fields (from repo): - `id`, `workspace_id`, `name`, `description`, `version_label`, `status` - `scope_jsonb` - `active_snapshot_id` - `created_by_user_id` ### `baseline_snapshots` (workspace-owned) - Purpose: immutable baseline snapshot, deduped by a snapshot identity hash. - Current fields: - `id`, `workspace_id`, `baseline_profile_id` - `snapshot_identity_hash` (sha256 string) - `captured_at` - `summary_jsonb` ### `baseline_snapshot_items` (workspace-owned; no tenant identifiers) - Purpose: per-subject baseline evidence for drift evaluation. - Current fields: - `baseline_snapshot_id` - `subject_type` (currently `policy`) - `subject_external_id` (legacy column name; MUST NOT store tenant external IDs in Spec 118 flows) - `policy_type` - `baseline_hash` (fingerprint) - `meta_jsonb` (metadata + provenance) ### `policy_versions` (tenant-owned evidence) - Purpose: immutable captured policy content with assignments/scope tags and hashes, used as content-fidelity evidence. - Current fields (selected): - `tenant_id`, `policy_id`, `policy_type`, `platform` - `captured_at` - `snapshot`, `metadata`, `assignments`, `scope_tags` - `assignments_hash`, `scope_tags_hash` ### `operation_runs` (tenant-owned operational record) - Purpose: observable lifecycle for capture/compare operations; `summary_counts` is numeric-only and key-whitelisted; diagnostics go in `context`. ### `findings` (tenant-owned drift outcomes) - Purpose: drift findings produced by compare; recurrence/lifecycle fields already exist in the repo (incl. `recurrence_key`). ## Proposed changes (Spec 118) ### 1) BaselineProfile: add capture mode **Add column**: `baseline_profiles.capture_mode` (string) - Allowed values: `meta_only | opportunistic | full_content` - Default: `opportunistic` (maintains current behavior unless explicitly enabled) - Validation: only allow known values ### 2) Baseline snapshot item: introduce a cross-tenant subject key **Add column**: `baseline_snapshot_items.subject_key` (string) - Meaning: cross-tenant match key for a subject: `normalized_display_name` - Normalization rules: trim, collapse internal whitespace, lowercase - Index: `index(baseline_snapshot_id, policy_type, subject_key)` Notes: - Workspace-owned snapshot items MUST NOT persist tenant identifiers. In Spec 118 flows: - `baseline_snapshot_items.subject_external_id` is treated as an opaque, workspace-safe **subject id** derived from `policy_type + subject_key` (e.g. `sha256(policy_type|subject_key)`), solely to satisfy existing uniqueness/lookup needs. - Tenant-specific external IDs remain tenant-scoped and live only in tenant-owned tables (`policies`, `inventory_items`, `policy_versions`) and in tenant-scoped `operation_runs.context`. - `meta_jsonb` stored on snapshot items MUST be baseline-safe (no tenant external IDs, no operation run IDs, no policy version IDs). It should include only cross-tenant metadata like `display_name`, `policy_type`, and a fidelity indicator (`content` vs `meta`). - Duplicate/ambiguous `subject_key` values within the same policy type are treated as evidence gaps and are not evaluated for drift. ### 3) PolicyVersion: purpose tagging + traceability **Add columns** (all nullable except purpose): - `policy_versions.capture_purpose` (string) - Allowed: `backup | baseline_capture | baseline_compare` - Default for existing rows: `backup` (or null → treated as `backup` at read time; exact backfill strategy documented in migration plan) - `policy_versions.operation_run_id` (unsigned bigint, nullable) → FK to `operation_runs.id` - `policy_versions.baseline_profile_id` (unsigned bigint, nullable) → FK to `baseline_profiles.id` **Indexes** (for audit/debug + idempotency checks): - `(tenant_id, policy_id, capture_purpose, captured_at desc)` - `(tenant_id, capture_purpose, operation_run_id)` - `(tenant_id, capture_purpose, baseline_profile_id)` Retention: - Baseline-purpose evidence is eligible for shorter retention (configurable) than long-term backup evidence. ### 4) OperationRun context: baseline capture/compare contract Baseline runs should populate `operation_runs.context` with stable, operator-facing keys: ```json { "target_scope": { "entra_tenant_id": "...", "entra_tenant_name": "...", "directory_context_id": "..." }, "baseline_profile_id": 123, "baseline_snapshot_id": 456, "capture_mode": "full_content", "effective_scope": { "policy_types": ["..."], "foundation_types": ["..."], "all_types": ["..."] }, "baseline_capture": { "subjects_total": 500, "evidence_capture": { "requested": 200, "succeeded": 180, "skipped": 10, "failed": 10, "throttled": 0 }, "gaps": { "count": 25, "top_reasons": ["forbidden", "throttled", "ambiguous_match"] }, "resume_token": "opaque_token_string" }, "baseline_compare": { "inventory_sync_run_id": 999, "since": "2026-03-03T09:00:00Z", "coverage": { "proof": true, "effective_types": ["..."], "covered_types": ["..."], "uncovered_types": ["..."] }, "fidelity": "content|meta|mixed", "evidence_capture": { "requested": 200, "succeeded": 180, "skipped": 10, "failed": 10, "throttled": 0 }, "evidence_gaps": { "missing_current": 20, "ambiguous_match": 3 }, "reason_code": "no_subjects_in_scope|coverage_unproven|evidence_capture_incomplete|rollout_disabled|no_drift_detected|..." } } ``` Notes: - `target_scope` is required for Monitoring UI (“Target” display). - Rich diagnostics remain in `context`; `summary_counts` stays within the numeric key whitelist. ## Migration strategy 1) Add `baseline_profiles.capture_mode`. 2) Add `baseline_snapshot_items.subject_key` + index. 3) Add `policy_versions.capture_purpose`, `operation_run_id`, `baseline_profile_id` + indexes. 4) Backfill strategy: - Existing `policy_versions` rows: set `capture_purpose = backup` (or treat null as backup in code until backfill finishes). - Existing baseline snapshot items: set `subject_key` from stored `meta_jsonb.display_name` when available (else empty; treated as gap in new logic). ## Validation rules - `capture_mode` must be one of: `meta_only`, `opportunistic`, `full_content`. - `subject_key` must be non-empty to be eligible for drift evaluation. - For full-content capture mode: - Capture/compare runs must record evidence capture stats and gaps. - Compare must not emit “missing policy” findings for uncovered policy types.