231 lines
12 KiB
Markdown
231 lines
12 KiB
Markdown
# Feature Specification: Inventory Core (Sync + Catalog)
|
||
|
||
**Feature Branch**: `feat/040-inventory-core`
|
||
**Created**: 2026-01-07
|
||
**Status**: Draft
|
||
|
||
## Overview
|
||
|
||
TenantPilot needs a reliable, tenant-scoped inventory catalog that represents what the system last observed in Microsoft Intune. This inventory is used as the primary substrate for analysis, reporting, monitoring, and UI visibility.
|
||
|
||
**Key intent:** Inventory is a “last observed” catalog (TenantPilot’s truth), not an absolute truth about Intune completeness.
|
||
|
||
**Non-goal:** A sync MUST NOT create snapshots or backups automatically.
|
||
|
||
## User Scenarios & Testing *(mandatory)*
|
||
|
||
### User Story 1 — Run Inventory Sync for a Tenant (Priority: P1)
|
||
|
||
A tenant admin (or scheduled automation) runs an inventory sync for a tenant to populate/update the inventory catalog.
|
||
|
||
**Why this priority**: Everything else depends on having a stable, queryable inventory catalog.
|
||
|
||
**Independent Test**: Run a sync for a tenant and verify inventory items are upserted, tenant-scoped, and last-observed fields update without producing snapshots/backups.
|
||
|
||
**Acceptance Scenarios**:
|
||
|
||
1. **Given** a tenant and a configured selection of policy types/categories, **When** a sync completes, **Then** inventory items are upserted for each observed object with correct `tenant_id`, `policy_type`, `external_id`, and `last_seen_at`.
|
||
2. **Given** an existing inventory item, **When** the same object is observed again, **Then** the existing record is updated (not duplicated) and `last_seen_at` and `last_seen_run_id` are updated.
|
||
3. **Given** a sync selection that excludes some policy types/categories, **When** the sync completes, **Then** only objects within that selection are observed/updated.
|
||
4. **Given** a successful sync, **When** the sync finishes, **Then** no policy snapshots/backups are created as a side effect.
|
||
|
||
---
|
||
|
||
### User Story 2 — Observe Completeness/Confidence of a Sync (Priority: P1)
|
||
|
||
A tenant admin views whether missing items are likely “not seen” due to partial/failed sync vs confidently missing in a clean run.
|
||
|
||
**Why this priority**: Prevents misleading conclusions (e.g., “deleted”) when Graph errors or permissions issues occur.
|
||
|
||
**Independent Test**: Mark a run as partial/failed and verify missing items are presented as low confidence (derived at query/UI time) and do not imply deletion.
|
||
|
||
**Acceptance Scenarios**:
|
||
|
||
1. **Given** a `latestRun` for a tenant+selection that has `status != success` or `had_errors = true`, **When** inventory is queried for missing items relative to that run, **Then** missing is presented as low confidence (and no stronger claim is made).
|
||
2. **Given** a `latestRun` for a tenant+selection that is `status = success` and `had_errors = false`, **When** an item was not observed in that run, **Then** the UI can show it as “not seen in latest run” (higher confidence) without implying deletion.
|
||
|
||
---
|
||
|
||
### User Story 3 — Monitor Sync Runs (Priority: P2)
|
||
|
||
A tenant admin (and platform admin) can see sync run history and quickly diagnose failures using stable error codes and counts.
|
||
|
||
**Why this priority**: Makes automation observable and supportable at MSP scale.
|
||
|
||
**Independent Test**: Create sync runs with different statuses and verify run records include counts and stable error codes.
|
||
|
||
**Acceptance Scenarios**:
|
||
|
||
1. **Given** multiple sync runs, **When** a user views run history, **Then** each run shows status, started/finished timestamps, and counts (observed/updated/errors).
|
||
2. **Given** a throttling event, **When** a sync run records it, **Then** the run captures a stable error code (e.g., “graph_throttled”) and does not fail silently.
|
||
|
||
---
|
||
|
||
### Edge Cases
|
||
|
||
- Sync is triggered twice for the same tenant+selection while the first is still running.
|
||
- Sync completes with partial results due to transient Graph errors.
|
||
- A tenant’s permissions change between runs causing objects to be invisible.
|
||
- Selection payload is equivalent but arrays are ordered differently.
|
||
|
||
## Requirements *(mandatory)*
|
||
|
||
### Functional Requirements
|
||
|
||
- **FR-001**: System MUST maintain an Inventory Catalog that represents TenantPilot’s last observed state of Intune objects.
|
||
- **FR-002**: System MUST upsert inventory items by a stable identity key that prevents duplicates.
|
||
- **FR-003**: System MUST record Sync Runs with status, timestamps, counts, and stable error codes.
|
||
- **FR-004**: System MUST ensure tenant isolation for all inventory and run queries.
|
||
- **FR-005**: System MUST support deterministic selection scoping via `selection_hash` for sync runs.
|
||
- **FR-006**: System MUST NOT create snapshots/backups during inventory sync (sync is not backup).
|
||
- **FR-007**: System MUST derive “missing” as a computed state relative to the latest completed run for the same tenant+selection.
|
||
- **FR-008**: System MUST enforce `meta_jsonb` key whitelisting by dropping unknown keys without failing the sync.
|
||
- **FR-009**: System MUST implement safe automation behavior: locking, idempotency, and observable failures.
|
||
|
||
### Non-Functional Requirements
|
||
|
||
- **NFR-001 (Concurrency limits)**: Sync automation MUST enforce two limits: a global concurrency limit (across tenants) and a per-tenant concurrency limit.
|
||
- **NFR-002 (Throttling resilience)**: Sync MUST handle throttling/transient failures (e.g., 429/503) using backoff + jitter.
|
||
- **NFR-003 (Deterministic behavior)**: Selection hashing and capability derivation MUST be deterministic and testable.
|
||
- **NFR-004 (Data minimization)**: Inventory MUST store metadata and whitelisted meta only; payload-heavy content belongs to snapshots/backups.
|
||
- **NFR-005 (Safe logging)**: Logs MUST not contain secrets/tokens; monitoring MUST rely on run records + error codes.
|
||
|
||
### Key Entities *(include if feature involves data)*
|
||
|
||
- **Inventory Item**: A tenant-scoped record representing a single Intune object as last observed (type, external identity, display name/metadata, last observed fields, whitelisted meta).
|
||
- **Sync Run**: A tenant-scoped record representing an inventory sync execution for a specific selection (selection_hash, status, timestamps, counts, stable error codes).
|
||
- **Selection Payload**: The normalized representation of the run scope used to compute selection_hash.
|
||
|
||
## Success Criteria *(mandatory)*
|
||
|
||
### Measurable Outcomes
|
||
|
||
- **SC-001**: For a given tenant, inventory sync can be executed repeatedly without creating duplicate inventory items.
|
||
- **SC-002**: A sync run always produces a run record with status, timestamps, and counts.
|
||
- **SC-003**: Missing is computed relative to latest completed run for the same tenant+selection; runs with different selection hashes do not affect each other.
|
||
- **SC-004**: Unknown meta keys never break sync and are not persisted.
|
||
- **SC-005**: Operators can distinguish “not seen” from “deleted” (deleted is reserved and not produced in this feature).
|
||
|
||
## Spec Appendix: Deterministic Selection + Missing Semantics (copy/paste-ready)
|
||
|
||
### Definition: “completed” and “latestRun”
|
||
|
||
- **Definition:** `completed` means `status ∈ {success, partial, failed, skipped}` and `finished_at != null` (or the equivalent field used by the run model).
|
||
- **Definition:** `latestRun` is the latest completed Sync Run for `(tenant_id, selection_hash)`.
|
||
|
||
### Selection Hash
|
||
|
||
- `selection_payload` includes only fields that influence run scope:
|
||
- `policy_types[]`, `categories[]`, `include_foundations` (bool), `include_dependencies` (bool)
|
||
- `canonical_json(payload)` is a canonical JSON serialization with:
|
||
- sorted object keys
|
||
- sorted arrays for `policy_types` and `categories`
|
||
- no whitespace / pretty formatting
|
||
- `selection_hash = sha256(canonical_json(selection_payload))`
|
||
- **AC:** Identical selection payload ⇒ identical selection_hash (independent of array ordering).
|
||
|
||
### Missing is derived (not persisted)
|
||
|
||
- **Definition:** Missing is a derived state computed at query/UI time relative to `latestRun(tenant_id, selection_hash)`.
|
||
- **AC:** Runs with different `selection_hash` do not affect missing computation for other selections.
|
||
- If `latestRun.status != success` or `latestRun.had_errors = true`, items not observed in that run are presented as `missing (low confidence)`.
|
||
|
||
### Deleted is reserved
|
||
|
||
- `deleted` is reserved and MUST NOT be produced by this feature.
|
||
- Only a later lifecycle feature may set `deleted` with strict verification rules.
|
||
|
||
### Meta Whitelist (Fail-safe)
|
||
|
||
- `meta_jsonb` has a documented whitelist of allowed keys.
|
||
- **AC:** Unknown `meta_jsonb` keys are dropped (not persisted) and MUST NOT cause sync to fail.
|
||
|
||
#### Initial `meta_jsonb` whitelist (v1)
|
||
|
||
Allowed keys (all optional; if not applicable for a type, omit):
|
||
|
||
- `odata_type`: string (copied from Graph `@odata.type`)
|
||
- `etag`: string|null (Graph etag if available; never treated as a secret)
|
||
- `scope_tag_ids`: array<string> (IDs only; no display names required)
|
||
- `assignment_target_count`: int|null (count only; no target details)
|
||
- `warnings`: array<string> (bounded, human-readable, no secrets)
|
||
|
||
**AC:** Any other key is dropped silently (not persisted) and MUST NOT fail sync.
|
||
|
||
### Observed Run
|
||
|
||
- `inventory_items.last_seen_run_id` and `inventory_items.last_seen_at` are updated when an item is observed.
|
||
- `last_seen_run_id` implies the selection via `sync_runs.selection_hash`; no per-item selection hash is required for core.
|
||
|
||
### Run Error Codes (taxonomy)
|
||
|
||
Sync runs record:
|
||
|
||
- `status`: one of `success|partial|failed|skipped`
|
||
- `had_errors`: bool (true if any non-ideal condition occurred)
|
||
- `error_codes[]`: array of stable machine-readable codes (no secrets)
|
||
|
||
Minimal taxonomy (3–8 codes):
|
||
|
||
- `lock_contended` (a run could not start because the per-tenant+selection lock is held)
|
||
- `concurrency_limit_global` (global concurrency limit reached; run skipped)
|
||
- `concurrency_limit_tenant` (per-tenant concurrency limit reached; run skipped)
|
||
- `graph_throttled` (429 encountered; run partial/failed depending on recovery)
|
||
- `graph_transient` (503/timeout/other transient errors)
|
||
- `graph_forbidden` (403/insufficient permission)
|
||
- `unexpected_exception` (unexpected failure; message must be safe/redacted)
|
||
|
||
**Rule:** Run records MUST store codes (and safe, bounded context) rather than raw exception dumps or tokens.
|
||
|
||
### Concurrency Limits (source, defaults, behavior)
|
||
|
||
**Source:** Config (recommended keys):
|
||
|
||
- `tenantpilot.inventory_sync.concurrency.global_max`
|
||
- `tenantpilot.inventory_sync.concurrency.per_tenant_max`
|
||
|
||
**Defaults (if not configured):**
|
||
|
||
- global_max = 2
|
||
- per_tenant_max = 1
|
||
|
||
**Behavior when limits are hit:**
|
||
|
||
- The system MUST create a Sync Run record with:
|
||
- `status = skipped`
|
||
- `had_errors = true` (so missing stays low-confidence for that selection)
|
||
- `error_codes[]` includes `concurrency_limit_global` or `concurrency_limit_tenant`
|
||
- `started_at`/`finished_at` set (observable)
|
||
- No inventory items are mutated in a skipped run.
|
||
|
||
## Testing Guidance (non-implementation)
|
||
|
||
These are test cases expressed in behavior terms (not code).
|
||
|
||
### Test Cases — Sync and Upsert
|
||
|
||
- **TC-001**: Sync creates or updates inventory items and sets `last_seen_at`.
|
||
- **TC-002**: Re-running sync for the same tenant+selection updates existing records and does not create duplicates.
|
||
- **TC-003**: Inventory queries scoped to Tenant A never return Tenant B’s items.
|
||
- **TC-004**: Inventory sync does not create or modify snapshot/backup records (e.g., no new rows in `policy_versions`, `backup_sets`, `backup_items`, `backup_schedules`, `backup_schedule_runs`).
|
||
|
||
### Test Cases — Selection Hash Determinism
|
||
|
||
- **TC-010**: Same selection payload with arrays in different order yields the same selection_hash.
|
||
- **TC-011**: Different selection payload yields a different selection_hash.
|
||
|
||
### Test Cases — Missing Semantics
|
||
|
||
- **TC-020**: Missing is derived relative to latest completed run for the same tenant+selection.
|
||
- **TC-021**: A run for selection Y does not affect missing computation for selection X.
|
||
- **TC-022**: If latestRun is partial/failed or had_errors, missing is shown as low confidence.
|
||
|
||
### Test Cases — Meta Whitelist
|
||
|
||
- **TC-030**: Unknown meta keys are not persisted and do not fail sync.
|
||
|
||
### Test Cases — Automation Safety
|
||
|
||
- **TC-040**: Concurrent sync triggers for the same tenant+selection do not result in overlapping runs (lock behavior).
|
||
- **TC-041**: A throttling event results in a visible, stable error code and a non-silent failure signal.
|