TenantAtlas/specs/040-inventory-core/spec.md
2026-01-07 14:58:39 +01:00

177 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Feature Specification: Inventory Core (Sync + Catalog)
**Feature Branch**: `feat/040-inventory-core`
**Created**: 2026-01-07
**Status**: Draft
## Overview
TenantPilot needs a reliable, tenant-scoped inventory catalog that represents what the system last observed in Microsoft Intune. This inventory is used as the primary substrate for analysis, reporting, monitoring, and UI visibility.
**Key intent:** Inventory is a “last observed” catalog (TenantPilots truth), not an absolute truth about Intune completeness.
**Non-goal:** A sync MUST NOT create snapshots or backups automatically.
## User Scenarios & Testing *(mandatory)*
### User Story 1 — Run Inventory Sync for a Tenant (Priority: P1)
A tenant admin (or scheduled automation) runs an inventory sync for a tenant to populate/update the inventory catalog.
**Why this priority**: Everything else depends on having a stable, queryable inventory catalog.
**Independent Test**: Run a sync for a tenant and verify inventory items are upserted, tenant-scoped, and last-observed fields update without producing snapshots/backups.
**Acceptance Scenarios**:
1. **Given** a tenant and a configured selection of policy types/categories, **When** a sync completes, **Then** inventory items are upserted for each observed object with correct `tenant_id`, `policy_type`, `external_id`, and `last_seen_at`.
2. **Given** an existing inventory item, **When** the same object is observed again, **Then** the existing record is updated (not duplicated) and `last_seen_at` and `last_seen_run_id` are updated.
3. **Given** a sync selection that excludes some policy types/categories, **When** the sync completes, **Then** only objects within that selection are observed/updated.
4. **Given** a successful sync, **When** the sync finishes, **Then** no policy snapshots/backups are created as a side effect.
---
### User Story 2 — Observe Completeness/Confidence of a Sync (Priority: P1)
A tenant admin views whether missing items are likely “not seen” due to partial/failed sync vs confidently missing in a clean run.
**Why this priority**: Prevents misleading conclusions (e.g., “deleted”) when Graph errors or permissions issues occur.
**Independent Test**: Mark a run as partial/failed and verify missing items are presented as low confidence (derived at query/UI time) and do not imply deletion.
**Acceptance Scenarios**:
1. **Given** a `latestRun` for a tenant+selection that has `status != success` or `had_errors = true`, **When** inventory is queried for missing items relative to that run, **Then** missing is presented as low confidence (and no stronger claim is made).
2. **Given** a `latestRun` for a tenant+selection that is `status = success` and `had_errors = false`, **When** an item was not observed in that run, **Then** the UI can show it as “not seen in latest run” (higher confidence) without implying deletion.
---
### User Story 3 — Monitor Sync Runs (Priority: P2)
A tenant admin (and platform admin) can see sync run history and quickly diagnose failures using stable error codes and counts.
**Why this priority**: Makes automation observable and supportable at MSP scale.
**Independent Test**: Create sync runs with different statuses and verify run records include counts and stable error codes.
**Acceptance Scenarios**:
1. **Given** multiple sync runs, **When** a user views run history, **Then** each run shows status, started/finished timestamps, and counts (observed/updated/errors).
2. **Given** a throttling event, **When** a sync run records it, **Then** the run captures a stable error code (e.g., “graph_throttled”) and does not fail silently.
---
### Edge Cases
- Sync is triggered twice for the same tenant+selection while the first is still running.
- Sync completes with partial results due to transient Graph errors.
- A tenants permissions change between runs causing objects to be invisible.
- Selection payload is equivalent but arrays are ordered differently.
## Requirements *(mandatory)*
### Functional Requirements
- **FR-001**: System MUST maintain an Inventory Catalog that represents TenantPilots last observed state of Intune objects.
- **FR-002**: System MUST upsert inventory items by a stable identity key that prevents duplicates.
- **FR-003**: System MUST record Sync Runs with status, timestamps, counts, and stable error codes.
- **FR-004**: System MUST ensure tenant isolation for all inventory and run queries.
- **FR-005**: System MUST support deterministic selection scoping via `selection_hash` for sync runs.
- **FR-006**: System MUST NOT create snapshots/backups during inventory sync (sync is not backup).
- **FR-007**: System MUST derive “missing” as a computed state relative to the latest completed run for the same tenant+selection.
- **FR-008**: System MUST enforce `meta_jsonb` key whitelisting by dropping unknown keys without failing the sync.
- **FR-009**: System MUST implement safe automation behavior: locking, idempotency, and observable failures.
### Non-Functional Requirements
- **NFR-001 (Concurrency limits)**: Sync automation MUST enforce two limits: a global concurrency limit (across tenants) and a per-tenant concurrency limit.
- **NFR-002 (Throttling resilience)**: Sync MUST handle throttling/transient failures (e.g., 429/503) using backoff + jitter.
- **NFR-003 (Deterministic behavior)**: Selection hashing and capability derivation MUST be deterministic and testable.
- **NFR-004 (Data minimization)**: Inventory MUST store metadata and whitelisted meta only; payload-heavy content belongs to snapshots/backups.
- **NFR-005 (Safe logging)**: Logs MUST not contain secrets/tokens; monitoring MUST rely on run records + error codes.
### Key Entities *(include if feature involves data)*
- **Inventory Item**: A tenant-scoped record representing a single Intune object as last observed (type, external identity, display name/metadata, last observed fields, whitelisted meta).
- **Sync Run**: A tenant-scoped record representing an inventory sync execution for a specific selection (selection_hash, status, timestamps, counts, stable error codes).
- **Selection Payload**: The normalized representation of the run scope used to compute selection_hash.
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-001**: For a given tenant, inventory sync can be executed repeatedly without creating duplicate inventory items.
- **SC-002**: A sync run always produces a run record with status, timestamps, and counts.
- **SC-003**: Missing is computed relative to latest completed run for the same tenant+selection; runs with different selection hashes do not affect each other.
- **SC-004**: Unknown meta keys never break sync and are not persisted.
- **SC-005**: Operators can distinguish “not seen” from “deleted” (deleted is reserved and not produced in this feature).
## Spec Appendix: Deterministic Selection + Missing Semantics (copy/paste-ready)
### Definition: “completed” and “latestRun”
- **Definition:** `completed` means `status ∈ {success, partial, failed, skipped}` and `finished_at != null` (or the equivalent field used by the run model).
- **Definition:** `latestRun` is the latest completed Sync Run for `(tenant_id, selection_hash)`.
### Selection Hash
- `selection_payload` includes only fields that influence run scope:
- `policy_types[]`, `categories[]`, `include_foundations` (bool), `include_dependencies` (bool)
- `canonical_json(payload)` is a canonical JSON serialization with:
- sorted object keys
- sorted arrays for `policy_types` and `categories`
- no whitespace / pretty formatting
- `selection_hash = sha256(canonical_json(selection_payload))`
- **AC:** Identical selection payload ⇒ identical selection_hash (independent of array ordering).
### Missing is derived (not persisted)
- **Definition:** Missing is a derived state computed at query/UI time relative to `latestRun(tenant_id, selection_hash)`.
- **AC:** Runs with different `selection_hash` do not affect missing computation for other selections.
- If `latestRun.status != success` or `latestRun.had_errors = true`, items not observed in that run are presented as `missing (low confidence)`.
### Deleted is reserved
- `deleted` is reserved and MUST NOT be produced by this feature.
- Only a later lifecycle feature may set `deleted` with strict verification rules.
### Meta Whitelist (Fail-safe)
- `meta_jsonb` has a documented whitelist of allowed keys.
- **AC:** Unknown `meta_jsonb` keys are dropped (not persisted) and MUST NOT cause sync to fail.
### Observed Run
- `inventory_items.last_seen_run_id` and `inventory_items.last_seen_at` are updated when an item is observed.
- `last_seen_run_id` implies the selection via `sync_runs.selection_hash`; no per-item selection hash is required for core.
## Testing Guidance (non-implementation)
These are test cases expressed in behavior terms (not code).
### Test Cases — Sync and Upsert
- **TC-001**: Sync creates or updates inventory items and sets `last_seen_at`.
- **TC-002**: Re-running sync for the same tenant+selection updates existing records and does not create duplicates.
- **TC-003**: Inventory queries scoped to Tenant A never return Tenant Bs items.
### Test Cases — Selection Hash Determinism
- **TC-010**: Same selection payload with arrays in different order yields the same selection_hash.
- **TC-011**: Different selection payload yields a different selection_hash.
### Test Cases — Missing Semantics
- **TC-020**: Missing is derived relative to latest completed run for the same tenant+selection.
- **TC-021**: A run for selection Y does not affect missing computation for selection X.
- **TC-022**: If latestRun is partial/failed or had_errors, missing is shown as low confidence.
### Test Cases — Meta Whitelist
- **TC-030**: Unknown meta keys are not persisted and do not fail sync.
### Test Cases — Automation Safety
- **TC-040**: Concurrent sync triggers for the same tenant+selection do not result in overlapping runs (lock behavior).
- **TC-041**: A throttling event results in a visible, stable error code and a non-silent failure signal.