TenantAtlas/specs/118-baseline-drift-engine/data-model.md

# Data Model — Spec 118 Golden Master Deep Drift v2

This document describes the data shapes required to implement full-content baseline capture/compare with quota-aware, resumable evidence capture.

Spec reference: `/Users/ahmeddarrazi/Documents/projects/TenantAtlas/specs/118-baseline-drift-engine/spec.md`

## Entities (existing)

### `baseline_profiles` (workspace-owned)

- Purpose: defines baseline name, scope, and (new) capture mode.
- Current fields (from repo):
  - `id`, `workspace_id`, `name`, `description`, `version_label`, `status`
  - `scope_jsonb`
  - `active_snapshot_id`
  - `created_by_user_id`

### `baseline_snapshots` (workspace-owned)

- Purpose: immutable baseline snapshot, deduped by a snapshot identity hash.
- Current fields:
  - `id`, `workspace_id`, `baseline_profile_id`
  - `snapshot_identity_hash` (sha256 string)
  - `captured_at`
  - `summary_jsonb`

### `baseline_snapshot_items` (workspace-owned; no tenant identifiers)

- Purpose: per-subject baseline evidence for drift evaluation.
- Current fields:
  - `baseline_snapshot_id`
  - `subject_type` (currently `policy`)
  - `subject_external_id` (legacy column name; MUST NOT store tenant external IDs in Spec 118 flows)
  - `policy_type`
  - `baseline_hash` (fingerprint)
  - `meta_jsonb` (metadata + provenance)

### `policy_versions` (tenant-owned evidence)

- Purpose: immutable captured policy content with assignments/scope tags and hashes, used as content-fidelity evidence.
- Current fields (selected):
  - `tenant_id`, `policy_id`, `policy_type`, `platform`
  - `captured_at`
  - `snapshot`, `metadata`, `assignments`, `scope_tags`
  - `assignments_hash`, `scope_tags_hash`

### `operation_runs` (tenant-owned operational record)

- Purpose: observable lifecycle for capture/compare operations; `summary_counts` is numeric-only and key-whitelisted; diagnostics go in `context`.

### `findings` (tenant-owned drift outcomes)

- Purpose: drift findings produced by compare; recurrence/lifecycle fields already exist in the repo (incl. `recurrence_key`).

## Proposed changes (Spec 118)

### 1) BaselineProfile: add capture mode

**Add column**: `baseline_profiles.capture_mode` (string)

- Allowed values: `meta_only | opportunistic | full_content`
- Default: `opportunistic` (maintains current behavior unless explicitly enabled)
- Validation: only allow known values

### 2) Baseline snapshot item: introduce a cross-tenant subject key

**Add column**: `baseline_snapshot_items.subject_key` (string)

- Meaning: cross-tenant match key for a subject: `normalized_display_name`
- Normalization rules: trim, collapse internal whitespace, lowercase
- Index: `index(baseline_snapshot_id, policy_type, subject_key)`

Notes:
- Workspace-owned snapshot items MUST NOT persist tenant identifiers. In Spec 118 flows:
  - `baseline_snapshot_items.subject_external_id` is treated as an opaque, workspace-safe **subject id** derived from `policy_type + subject_key` (e.g. `sha256(policy_type|subject_key)`), solely to satisfy existing uniqueness/lookup needs.
  - Tenant-specific external IDs remain tenant-scoped and live only in tenant-owned tables (`policies`, `inventory_items`, `policy_versions`) and in tenant-scoped `operation_runs.context`.
- `meta_jsonb` stored on snapshot items MUST be baseline-safe (no tenant external IDs, no operation run IDs, no policy version IDs). It should include only cross-tenant metadata like `display_name`, `policy_type`, and a fidelity indicator (`content` vs `meta`).
- Duplicate/ambiguous `subject_key` values within the same policy type are treated as evidence gaps and are not evaluated for drift.

### 3) PolicyVersion: purpose tagging + traceability

**Add columns** (all nullable except purpose):

- `policy_versions.capture_purpose` (string)
  - Allowed: `backup | baseline_capture | baseline_compare`
  - Default for existing rows: `backup` (or null → treated as `backup` at read time; exact backfill strategy documented in migration plan)
- `policy_versions.operation_run_id` (unsigned bigint, nullable) → FK to `operation_runs.id`
- `policy_versions.baseline_profile_id` (unsigned bigint, nullable) → FK to `baseline_profiles.id`

**Indexes** (for audit/debug + idempotency checks):

- `(tenant_id, policy_id, capture_purpose, captured_at desc)`
- `(tenant_id, capture_purpose, operation_run_id)`
- `(tenant_id, capture_purpose, baseline_profile_id)`

Retention:
- Baseline-purpose evidence is eligible for shorter retention (configurable) than long-term backup evidence.

### 4) OperationRun context: baseline capture/compare contract

Baseline runs should populate `operation_runs.context` with stable, operator-facing keys:

```json
{
  "target_scope": {
    "entra_tenant_id": "...",
    "entra_tenant_name": "...",
    "directory_context_id": "..."
  },
  "baseline_profile_id": 123,
  "baseline_snapshot_id": 456,
  "capture_mode": "full_content",
  "effective_scope": {
    "policy_types": ["..."],
    "foundation_types": ["..."],
    "all_types": ["..."]
  },
  "baseline_capture": {
    "subjects_total": 500,
    "evidence_capture": {
      "requested": 200,
      "succeeded": 180,
      "skipped": 10,
      "failed": 10,
      "throttled": 0
    },
    "gaps": {
      "count": 25,
      "top_reasons": ["forbidden", "throttled", "ambiguous_match"]
    },
    "resume_token": "opaque_token_string"
  },
  "baseline_compare": {
    "inventory_sync_run_id": 999,
    "since": "2026-03-03T09:00:00Z",
    "coverage": {
      "proof": true,
      "effective_types": ["..."],
      "covered_types": ["..."],
      "uncovered_types": ["..."]
    },
    "fidelity": "content|meta|mixed",
    "evidence_capture": {
      "requested": 200,
      "succeeded": 180,
      "skipped": 10,
      "failed": 10,
      "throttled": 0
    },
    "evidence_gaps": {
      "missing_current": 20,
      "ambiguous_match": 3
    },
    "reason_code": "no_subjects_in_scope|coverage_unproven|evidence_capture_incomplete|rollout_disabled|no_drift_detected|..."
  }
}
```

Notes:
- `target_scope` is required for Monitoring UI (“Target” display).
- Rich diagnostics remain in `context`; `summary_counts` stays within the numeric key whitelist.

## Migration strategy

1) Add `baseline_profiles.capture_mode`.
2) Add `baseline_snapshot_items.subject_key` + index.
3) Add `policy_versions.capture_purpose`, `operation_run_id`, `baseline_profile_id` + indexes.
4) Backfill strategy:
   - Existing `policy_versions` rows: set `capture_purpose = backup` (or treat null as backup in code until backfill finishes).
   - Existing baseline snapshot items: set `subject_key` from stored `meta_jsonb.display_name` when available (else empty; treated as gap in new logic).

## Validation rules

- `capture_mode` must be one of: `meta_only`, `opportunistic`, `full_content`.
- `subject_key` must be non-empty to be eligible for drift evaluation.
- For full-content capture mode:
  - Capture/compare runs must record evidence capture stats and gaps.
  - Compare must not emit “missing policy” findings for uncovered policy types.