TenantAtlas/specs/040-inventory-core/spec.md
ahmido 8ae7a7234e feat/040-inventory-core (#43)
Summary

Implements Inventory Core (Spec 040): a tenant-scoped, mutable “last observed” inventory catalog + sync run logging, with deterministic selection hashing and safe derived “missing” semantics.

This establishes the foundation for Inventory UI (041), Dependencies Graph (042), Compare/Promotion (043), and Drift (044).

What’s included
	•	DB schema
	•	inventory_items (unique: tenant_id + policy_type + external_id; indexes; last_seen_at, last_seen_run_id)
	•	inventory_sync_runs (tenant_id, selection_hash/payload, status, started/finished, counts, error_codes, correlation_id)
	•	Selection hashing
	•	Deterministic selection_hash via canonical JSON (sorted keys + sorted arrays) + sha256
	•	Sync semantics
	•	Idempotent upsert (no duplicates)
	•	Updates last_seen_* when observed
	•	Enforces tenant scoping for all reads/writes
	•	Guardrail: inventory sync does not create snapshots/backups
	•	Missing semantics (derived)
	•	“missing” computed relative to latest completed run for same (tenant_id, selection_hash)
	•	Low confidence when latest run is partial/failed or had_errors=true
	•	Selection isolation (runs for other selections don’t affect missing)
	•	deleted is reserved (not produced here)
	•	Safety
	•	meta_jsonb whitelist enforced (unknown keys dropped; never fail sync)
	•	Safe error persistence (no bearer tokens / secrets)
	•	Locking to prevent overlapping runs for same tenant+selection
	•	Concurrency limiter (global + per-tenant) and throttling resilience (429/503 backoff + jitter)

Tests

Added Pest coverage for:
	•	selection_hash determinism (array order invariant)
	•	upsert idempotency + last_seen updates
	•	missing derived semantics + selection isolation
	•	low confidence missing on partial/had_errors
	•	meta whitelist drop (no exception)
	•	lock prevents overlapping runs
	•	no snapshots/backups side effects
	•	safe error persistence (no bearer tokens)

Non-goals
	•	Inventory UI pages/resources (Spec 041)
	•	Dependency graph hydration (Spec 042)
	•	Cross-tenant compare/promotion flows (Spec 043)
	•	Drift analysis dashboards (Spec 044)

Review focus
	•	Data model correctness + indexes/constraints
	•	Selection hash canonicalization (determinism)
	•	Missing semantics (latest completed run + confidence rule)
	•	Guardrails (no snapshot/backups side effects)
	•	Safety: error_code taxonomy + safe persistence/logging

Co-authored-by: Ahmed Darrazi <ahmeddarrazi@adsmac.local>
Reviewed-on: #43
2026-01-07 14:54:24 +00:00

12 KiB
Raw Blame History

Feature Specification: Inventory Core (Sync + Catalog)

Feature Branch: feat/040-inventory-core
Created: 2026-01-07
Status: Draft

Overview

TenantPilot needs a reliable, tenant-scoped inventory catalog that represents what the system last observed in Microsoft Intune. This inventory is used as the primary substrate for analysis, reporting, monitoring, and UI visibility.

Key intent: Inventory is a “last observed” catalog (TenantPilots truth), not an absolute truth about Intune completeness.

Non-goal: A sync MUST NOT create snapshots or backups automatically.

User Scenarios & Testing (mandatory)

User Story 1 — Run Inventory Sync for a Tenant (Priority: P1)

A tenant admin (or scheduled automation) runs an inventory sync for a tenant to populate/update the inventory catalog.

Why this priority: Everything else depends on having a stable, queryable inventory catalog.

Independent Test: Run a sync for a tenant and verify inventory items are upserted, tenant-scoped, and last-observed fields update without producing snapshots/backups.

Acceptance Scenarios:

  1. Given a tenant and a configured selection of policy types/categories, When a sync completes, Then inventory items are upserted for each observed object with correct tenant_id, policy_type, external_id, and last_seen_at.
  2. Given an existing inventory item, When the same object is observed again, Then the existing record is updated (not duplicated) and last_seen_at and last_seen_run_id are updated.
  3. Given a sync selection that excludes some policy types/categories, When the sync completes, Then only objects within that selection are observed/updated.
  4. Given a successful sync, When the sync finishes, Then no policy snapshots/backups are created as a side effect.

User Story 2 — Observe Completeness/Confidence of a Sync (Priority: P1)

A tenant admin views whether missing items are likely “not seen” due to partial/failed sync vs confidently missing in a clean run.

Why this priority: Prevents misleading conclusions (e.g., “deleted”) when Graph errors or permissions issues occur.

Independent Test: Mark a run as partial/failed and verify missing items are presented as low confidence (derived at query/UI time) and do not imply deletion.

Acceptance Scenarios:

  1. Given a latestRun for a tenant+selection that has status != success or had_errors = true, When inventory is queried for missing items relative to that run, Then missing is presented as low confidence (and no stronger claim is made).
  2. Given a latestRun for a tenant+selection that is status = success and had_errors = false, When an item was not observed in that run, Then the UI can show it as “not seen in latest run” (higher confidence) without implying deletion.

User Story 3 — Monitor Sync Runs (Priority: P2)

A tenant admin (and platform admin) can see sync run history and quickly diagnose failures using stable error codes and counts.

Why this priority: Makes automation observable and supportable at MSP scale.

Independent Test: Create sync runs with different statuses and verify run records include counts and stable error codes.

Acceptance Scenarios:

  1. Given multiple sync runs, When a user views run history, Then each run shows status, started/finished timestamps, and counts (observed/updated/errors).
  2. Given a throttling event, When a sync run records it, Then the run captures a stable error code (e.g., “graph_throttled”) and does not fail silently.

Edge Cases

  • Sync is triggered twice for the same tenant+selection while the first is still running.
  • Sync completes with partial results due to transient Graph errors.
  • A tenants permissions change between runs causing objects to be invisible.
  • Selection payload is equivalent but arrays are ordered differently.

Requirements (mandatory)

Functional Requirements

  • FR-001: System MUST maintain an Inventory Catalog that represents TenantPilots last observed state of Intune objects.
  • FR-002: System MUST upsert inventory items by a stable identity key that prevents duplicates.
  • FR-003: System MUST record Sync Runs with status, timestamps, counts, and stable error codes.
  • FR-004: System MUST ensure tenant isolation for all inventory and run queries.
  • FR-005: System MUST support deterministic selection scoping via selection_hash for sync runs.
  • FR-006: System MUST NOT create snapshots/backups during inventory sync (sync is not backup).
  • FR-007: System MUST derive “missing” as a computed state relative to the latest completed run for the same tenant+selection.
  • FR-008: System MUST enforce meta_jsonb key whitelisting by dropping unknown keys without failing the sync.
  • FR-009: System MUST implement safe automation behavior: locking, idempotency, and observable failures.

Non-Functional Requirements

  • NFR-001 (Concurrency limits): Sync automation MUST enforce two limits: a global concurrency limit (across tenants) and a per-tenant concurrency limit.
  • NFR-002 (Throttling resilience): Sync MUST handle throttling/transient failures (e.g., 429/503) using backoff + jitter.
  • NFR-003 (Deterministic behavior): Selection hashing and capability derivation MUST be deterministic and testable.
  • NFR-004 (Data minimization): Inventory MUST store metadata and whitelisted meta only; payload-heavy content belongs to snapshots/backups.
  • NFR-005 (Safe logging): Logs MUST not contain secrets/tokens; monitoring MUST rely on run records + error codes.

Key Entities (include if feature involves data)

  • Inventory Item: A tenant-scoped record representing a single Intune object as last observed (type, external identity, display name/metadata, last observed fields, whitelisted meta).
  • Sync Run: A tenant-scoped record representing an inventory sync execution for a specific selection (selection_hash, status, timestamps, counts, stable error codes).
  • Selection Payload: The normalized representation of the run scope used to compute selection_hash.

Success Criteria (mandatory)

Measurable Outcomes

  • SC-001: For a given tenant, inventory sync can be executed repeatedly without creating duplicate inventory items.
  • SC-002: A sync run always produces a run record with status, timestamps, and counts.
  • SC-003: Missing is computed relative to latest completed run for the same tenant+selection; runs with different selection hashes do not affect each other.
  • SC-004: Unknown meta keys never break sync and are not persisted.
  • SC-005: Operators can distinguish “not seen” from “deleted” (deleted is reserved and not produced in this feature).

Spec Appendix: Deterministic Selection + Missing Semantics (copy/paste-ready)

Definition: “completed” and “latestRun”

  • Definition: completed means status ∈ {success, partial, failed, skipped} and finished_at != null (or the equivalent field used by the run model).
  • Definition: latestRun is the latest completed Sync Run for (tenant_id, selection_hash).

Selection Hash

  • selection_payload includes only fields that influence run scope:
    • policy_types[], categories[], include_foundations (bool), include_dependencies (bool)
  • canonical_json(payload) is a canonical JSON serialization with:
    • sorted object keys
    • sorted arrays for policy_types and categories
    • no whitespace / pretty formatting
  • selection_hash = sha256(canonical_json(selection_payload))
  • AC: Identical selection payload ⇒ identical selection_hash (independent of array ordering).

Missing is derived (not persisted)

  • Definition: Missing is a derived state computed at query/UI time relative to latestRun(tenant_id, selection_hash).
  • AC: Runs with different selection_hash do not affect missing computation for other selections.
  • If latestRun.status != success or latestRun.had_errors = true, items not observed in that run are presented as missing (low confidence).

Deleted is reserved

  • deleted is reserved and MUST NOT be produced by this feature.
  • Only a later lifecycle feature may set deleted with strict verification rules.

Meta Whitelist (Fail-safe)

  • meta_jsonb has a documented whitelist of allowed keys.
  • AC: Unknown meta_jsonb keys are dropped (not persisted) and MUST NOT cause sync to fail.

Initial meta_jsonb whitelist (v1)

Allowed keys (all optional; if not applicable for a type, omit):

  • odata_type: string (copied from Graph @odata.type)
  • etag: string|null (Graph etag if available; never treated as a secret)
  • scope_tag_ids: array (IDs only; no display names required)
  • assignment_target_count: int|null (count only; no target details)
  • warnings: array (bounded, human-readable, no secrets)

AC: Any other key is dropped silently (not persisted) and MUST NOT fail sync.

Observed Run

  • inventory_items.last_seen_run_id and inventory_items.last_seen_at are updated when an item is observed.
  • last_seen_run_id implies the selection via sync_runs.selection_hash; no per-item selection hash is required for core.

Run Error Codes (taxonomy)

Sync runs record:

  • status: one of success|partial|failed|skipped
  • had_errors: bool (true if any non-ideal condition occurred)
  • error_codes[]: array of stable machine-readable codes (no secrets)

Minimal taxonomy (38 codes):

  • lock_contended (a run could not start because the per-tenant+selection lock is held)
  • concurrency_limit_global (global concurrency limit reached; run skipped)
  • concurrency_limit_tenant (per-tenant concurrency limit reached; run skipped)
  • graph_throttled (429 encountered; run partial/failed depending on recovery)
  • graph_transient (503/timeout/other transient errors)
  • graph_forbidden (403/insufficient permission)
  • unexpected_exception (unexpected failure; message must be safe/redacted)

Rule: Run records MUST store codes (and safe, bounded context) rather than raw exception dumps or tokens.

Concurrency Limits (source, defaults, behavior)

Source: Config (recommended keys):

  • tenantpilot.inventory_sync.concurrency.global_max
  • tenantpilot.inventory_sync.concurrency.per_tenant_max

Defaults (if not configured):

  • global_max = 2
  • per_tenant_max = 1

Behavior when limits are hit:

  • The system MUST create a Sync Run record with:
    • status = skipped
    • had_errors = true (so missing stays low-confidence for that selection)
    • error_codes[] includes concurrency_limit_global or concurrency_limit_tenant
    • started_at/finished_at set (observable)
  • No inventory items are mutated in a skipped run.

Testing Guidance (non-implementation)

These are test cases expressed in behavior terms (not code).

Test Cases — Sync and Upsert

  • TC-001: Sync creates or updates inventory items and sets last_seen_at.
  • TC-002: Re-running sync for the same tenant+selection updates existing records and does not create duplicates.
  • TC-003: Inventory queries scoped to Tenant A never return Tenant Bs items.
  • TC-004: Inventory sync does not create or modify snapshot/backup records (e.g., no new rows in policy_versions, backup_sets, backup_items, backup_schedules, backup_schedule_runs).

Test Cases — Selection Hash Determinism

  • TC-010: Same selection payload with arrays in different order yields the same selection_hash.
  • TC-011: Different selection payload yields a different selection_hash.

Test Cases — Missing Semantics

  • TC-020: Missing is derived relative to latest completed run for the same tenant+selection.
  • TC-021: A run for selection Y does not affect missing computation for selection X.
  • TC-022: If latestRun is partial/failed or had_errors, missing is shown as low confidence.

Test Cases — Meta Whitelist

  • TC-030: Unknown meta keys are not persisted and do not fail sync.

Test Cases — Automation Safety

  • TC-040: Concurrent sync triggers for the same tenant+selection do not result in overlapping runs (lock behavior).
  • TC-041: A throttling event results in a visible, stable error code and a non-silent failure signal.