TenantAtlas/specs/040-inventory-core/spec.md
ahmido 8ae7a7234e feat/040-inventory-core (#43)
Summary

Implements Inventory Core (Spec 040): a tenant-scoped, mutable “last observed” inventory catalog + sync run logging, with deterministic selection hashing and safe derived “missing” semantics.

This establishes the foundation for Inventory UI (041), Dependencies Graph (042), Compare/Promotion (043), and Drift (044).

What’s included
	•	DB schema
	•	inventory_items (unique: tenant_id + policy_type + external_id; indexes; last_seen_at, last_seen_run_id)
	•	inventory_sync_runs (tenant_id, selection_hash/payload, status, started/finished, counts, error_codes, correlation_id)
	•	Selection hashing
	•	Deterministic selection_hash via canonical JSON (sorted keys + sorted arrays) + sha256
	•	Sync semantics
	•	Idempotent upsert (no duplicates)
	•	Updates last_seen_* when observed
	•	Enforces tenant scoping for all reads/writes
	•	Guardrail: inventory sync does not create snapshots/backups
	•	Missing semantics (derived)
	•	“missing” computed relative to latest completed run for same (tenant_id, selection_hash)
	•	Low confidence when latest run is partial/failed or had_errors=true
	•	Selection isolation (runs for other selections don’t affect missing)
	•	deleted is reserved (not produced here)
	•	Safety
	•	meta_jsonb whitelist enforced (unknown keys dropped; never fail sync)
	•	Safe error persistence (no bearer tokens / secrets)
	•	Locking to prevent overlapping runs for same tenant+selection
	•	Concurrency limiter (global + per-tenant) and throttling resilience (429/503 backoff + jitter)

Tests

Added Pest coverage for:
	•	selection_hash determinism (array order invariant)
	•	upsert idempotency + last_seen updates
	•	missing derived semantics + selection isolation
	•	low confidence missing on partial/had_errors
	•	meta whitelist drop (no exception)
	•	lock prevents overlapping runs
	•	no snapshots/backups side effects
	•	safe error persistence (no bearer tokens)

Non-goals
	•	Inventory UI pages/resources (Spec 041)
	•	Dependency graph hydration (Spec 042)
	•	Cross-tenant compare/promotion flows (Spec 043)
	•	Drift analysis dashboards (Spec 044)

Review focus
	•	Data model correctness + indexes/constraints
	•	Selection hash canonicalization (determinism)
	•	Missing semantics (latest completed run + confidence rule)
	•	Guardrails (no snapshot/backups side effects)
	•	Safety: error_code taxonomy + safe persistence/logging

Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local>
Reviewed-on: #43
2026-01-07 14:54:24 +00:00

231 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Feature Specification: Inventory Core (Sync + Catalog)
**Feature Branch**: `feat/040-inventory-core`
**Created**: 2026-01-07
**Status**: Draft
## Overview
TenantPilot needs a reliable, tenant-scoped inventory catalog that represents what the system last observed in Microsoft Intune. This inventory is used as the primary substrate for analysis, reporting, monitoring, and UI visibility.
**Key intent:** Inventory is a “last observed” catalog (TenantPilots truth), not an absolute truth about Intune completeness.
**Non-goal:** A sync MUST NOT create snapshots or backups automatically.
## User Scenarios & Testing *(mandatory)*
### User Story 1 — Run Inventory Sync for a Tenant (Priority: P1)
A tenant admin (or scheduled automation) runs an inventory sync for a tenant to populate/update the inventory catalog.
**Why this priority**: Everything else depends on having a stable, queryable inventory catalog.
**Independent Test**: Run a sync for a tenant and verify inventory items are upserted, tenant-scoped, and last-observed fields update without producing snapshots/backups.
**Acceptance Scenarios**:
1. **Given** a tenant and a configured selection of policy types/categories, **When** a sync completes, **Then** inventory items are upserted for each observed object with correct `tenant_id`, `policy_type`, `external_id`, and `last_seen_at`.
2. **Given** an existing inventory item, **When** the same object is observed again, **Then** the existing record is updated (not duplicated) and `last_seen_at` and `last_seen_run_id` are updated.
3. **Given** a sync selection that excludes some policy types/categories, **When** the sync completes, **Then** only objects within that selection are observed/updated.
4. **Given** a successful sync, **When** the sync finishes, **Then** no policy snapshots/backups are created as a side effect.
---
### User Story 2 — Observe Completeness/Confidence of a Sync (Priority: P1)
A tenant admin views whether missing items are likely “not seen” due to partial/failed sync vs confidently missing in a clean run.
**Why this priority**: Prevents misleading conclusions (e.g., “deleted”) when Graph errors or permissions issues occur.
**Independent Test**: Mark a run as partial/failed and verify missing items are presented as low confidence (derived at query/UI time) and do not imply deletion.
**Acceptance Scenarios**:
1. **Given** a `latestRun` for a tenant+selection that has `status != success` or `had_errors = true`, **When** inventory is queried for missing items relative to that run, **Then** missing is presented as low confidence (and no stronger claim is made).
2. **Given** a `latestRun` for a tenant+selection that is `status = success` and `had_errors = false`, **When** an item was not observed in that run, **Then** the UI can show it as “not seen in latest run” (higher confidence) without implying deletion.
---
### User Story 3 — Monitor Sync Runs (Priority: P2)
A tenant admin (and platform admin) can see sync run history and quickly diagnose failures using stable error codes and counts.
**Why this priority**: Makes automation observable and supportable at MSP scale.
**Independent Test**: Create sync runs with different statuses and verify run records include counts and stable error codes.
**Acceptance Scenarios**:
1. **Given** multiple sync runs, **When** a user views run history, **Then** each run shows status, started/finished timestamps, and counts (observed/updated/errors).
2. **Given** a throttling event, **When** a sync run records it, **Then** the run captures a stable error code (e.g., “graph_throttled”) and does not fail silently.
---
### Edge Cases
- Sync is triggered twice for the same tenant+selection while the first is still running.
- Sync completes with partial results due to transient Graph errors.
- A tenants permissions change between runs causing objects to be invisible.
- Selection payload is equivalent but arrays are ordered differently.
## Requirements *(mandatory)*
### Functional Requirements
- **FR-001**: System MUST maintain an Inventory Catalog that represents TenantPilots last observed state of Intune objects.
- **FR-002**: System MUST upsert inventory items by a stable identity key that prevents duplicates.
- **FR-003**: System MUST record Sync Runs with status, timestamps, counts, and stable error codes.
- **FR-004**: System MUST ensure tenant isolation for all inventory and run queries.
- **FR-005**: System MUST support deterministic selection scoping via `selection_hash` for sync runs.
- **FR-006**: System MUST NOT create snapshots/backups during inventory sync (sync is not backup).
- **FR-007**: System MUST derive “missing” as a computed state relative to the latest completed run for the same tenant+selection.
- **FR-008**: System MUST enforce `meta_jsonb` key whitelisting by dropping unknown keys without failing the sync.
- **FR-009**: System MUST implement safe automation behavior: locking, idempotency, and observable failures.
### Non-Functional Requirements
- **NFR-001 (Concurrency limits)**: Sync automation MUST enforce two limits: a global concurrency limit (across tenants) and a per-tenant concurrency limit.
- **NFR-002 (Throttling resilience)**: Sync MUST handle throttling/transient failures (e.g., 429/503) using backoff + jitter.
- **NFR-003 (Deterministic behavior)**: Selection hashing and capability derivation MUST be deterministic and testable.
- **NFR-004 (Data minimization)**: Inventory MUST store metadata and whitelisted meta only; payload-heavy content belongs to snapshots/backups.
- **NFR-005 (Safe logging)**: Logs MUST not contain secrets/tokens; monitoring MUST rely on run records + error codes.
### Key Entities *(include if feature involves data)*
- **Inventory Item**: A tenant-scoped record representing a single Intune object as last observed (type, external identity, display name/metadata, last observed fields, whitelisted meta).
- **Sync Run**: A tenant-scoped record representing an inventory sync execution for a specific selection (selection_hash, status, timestamps, counts, stable error codes).
- **Selection Payload**: The normalized representation of the run scope used to compute selection_hash.
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-001**: For a given tenant, inventory sync can be executed repeatedly without creating duplicate inventory items.
- **SC-002**: A sync run always produces a run record with status, timestamps, and counts.
- **SC-003**: Missing is computed relative to latest completed run for the same tenant+selection; runs with different selection hashes do not affect each other.
- **SC-004**: Unknown meta keys never break sync and are not persisted.
- **SC-005**: Operators can distinguish “not seen” from “deleted” (deleted is reserved and not produced in this feature).
## Spec Appendix: Deterministic Selection + Missing Semantics (copy/paste-ready)
### Definition: “completed” and “latestRun”
- **Definition:** `completed` means `status ∈ {success, partial, failed, skipped}` and `finished_at != null` (or the equivalent field used by the run model).
- **Definition:** `latestRun` is the latest completed Sync Run for `(tenant_id, selection_hash)`.
### Selection Hash
- `selection_payload` includes only fields that influence run scope:
- `policy_types[]`, `categories[]`, `include_foundations` (bool), `include_dependencies` (bool)
- `canonical_json(payload)` is a canonical JSON serialization with:
- sorted object keys
- sorted arrays for `policy_types` and `categories`
- no whitespace / pretty formatting
- `selection_hash = sha256(canonical_json(selection_payload))`
- **AC:** Identical selection payload ⇒ identical selection_hash (independent of array ordering).
### Missing is derived (not persisted)
- **Definition:** Missing is a derived state computed at query/UI time relative to `latestRun(tenant_id, selection_hash)`.
- **AC:** Runs with different `selection_hash` do not affect missing computation for other selections.
- If `latestRun.status != success` or `latestRun.had_errors = true`, items not observed in that run are presented as `missing (low confidence)`.
### Deleted is reserved
- `deleted` is reserved and MUST NOT be produced by this feature.
- Only a later lifecycle feature may set `deleted` with strict verification rules.
### Meta Whitelist (Fail-safe)
- `meta_jsonb` has a documented whitelist of allowed keys.
- **AC:** Unknown `meta_jsonb` keys are dropped (not persisted) and MUST NOT cause sync to fail.
#### Initial `meta_jsonb` whitelist (v1)
Allowed keys (all optional; if not applicable for a type, omit):
- `odata_type`: string (copied from Graph `@odata.type`)
- `etag`: string|null (Graph etag if available; never treated as a secret)
- `scope_tag_ids`: array<string> (IDs only; no display names required)
- `assignment_target_count`: int|null (count only; no target details)
- `warnings`: array<string> (bounded, human-readable, no secrets)
**AC:** Any other key is dropped silently (not persisted) and MUST NOT fail sync.
### Observed Run
- `inventory_items.last_seen_run_id` and `inventory_items.last_seen_at` are updated when an item is observed.
- `last_seen_run_id` implies the selection via `sync_runs.selection_hash`; no per-item selection hash is required for core.
### Run Error Codes (taxonomy)
Sync runs record:
- `status`: one of `success|partial|failed|skipped`
- `had_errors`: bool (true if any non-ideal condition occurred)
- `error_codes[]`: array of stable machine-readable codes (no secrets)
Minimal taxonomy (38 codes):
- `lock_contended` (a run could not start because the per-tenant+selection lock is held)
- `concurrency_limit_global` (global concurrency limit reached; run skipped)
- `concurrency_limit_tenant` (per-tenant concurrency limit reached; run skipped)
- `graph_throttled` (429 encountered; run partial/failed depending on recovery)
- `graph_transient` (503/timeout/other transient errors)
- `graph_forbidden` (403/insufficient permission)
- `unexpected_exception` (unexpected failure; message must be safe/redacted)
**Rule:** Run records MUST store codes (and safe, bounded context) rather than raw exception dumps or tokens.
### Concurrency Limits (source, defaults, behavior)
**Source:** Config (recommended keys):
- `tenantpilot.inventory_sync.concurrency.global_max`
- `tenantpilot.inventory_sync.concurrency.per_tenant_max`
**Defaults (if not configured):**
- global_max = 2
- per_tenant_max = 1
**Behavior when limits are hit:**
- The system MUST create a Sync Run record with:
- `status = skipped`
- `had_errors = true` (so missing stays low-confidence for that selection)
- `error_codes[]` includes `concurrency_limit_global` or `concurrency_limit_tenant`
- `started_at`/`finished_at` set (observable)
- No inventory items are mutated in a skipped run.
## Testing Guidance (non-implementation)
These are test cases expressed in behavior terms (not code).
### Test Cases — Sync and Upsert
- **TC-001**: Sync creates or updates inventory items and sets `last_seen_at`.
- **TC-002**: Re-running sync for the same tenant+selection updates existing records and does not create duplicates.
- **TC-003**: Inventory queries scoped to Tenant A never return Tenant Bs items.
- **TC-004**: Inventory sync does not create or modify snapshot/backup records (e.g., no new rows in `policy_versions`, `backup_sets`, `backup_items`, `backup_schedules`, `backup_schedule_runs`).
### Test Cases — Selection Hash Determinism
- **TC-010**: Same selection payload with arrays in different order yields the same selection_hash.
- **TC-011**: Different selection payload yields a different selection_hash.
### Test Cases — Missing Semantics
- **TC-020**: Missing is derived relative to latest completed run for the same tenant+selection.
- **TC-021**: A run for selection Y does not affect missing computation for selection X.
- **TC-022**: If latestRun is partial/failed or had_errors, missing is shown as low confidence.
### Test Cases — Meta Whitelist
- **TC-030**: Unknown meta keys are not persisted and do not fail sync.
### Test Cases — Automation Safety
- **TC-040**: Concurrent sync triggers for the same tenant+selection do not result in overlapping runs (lock behavior).
- **TC-041**: A throttling event results in a visible, stable error code and a non-silent failure signal.