TenantAtlas/specs/051-entra-group-directory-cache/spec.md

# Feature Specification: Entra Group Directory Cache (Groups v1)

**Feature Branch**: `051-entra-group-directory-cache`
**Created**: 2026-01-11
**Status**: Draft
**Input**: User description: "Tenant-scoped Entra ID Groups cache (read-only), populated by queued sync runs, used for name-resolution across the suite; UI must render from DB-only (no live directory calls)."

## Clarifications

### Session 2026-01-11

- Q: What is the scope of the Groups v1 sync source? → A: All groups in the tenant.
- Q: How is the Groups sync started (MVP)? → A: Manual + scheduled/periodic.
- Q: What happens in the cache when a group is not returned on the next full sync? → A: Retain for 90 days after last seen, then purge.
- Q: In which auth mode does TenantAtlas read groups for sync runs? → A: App-only / service principal.
- Q: What data is cached in Groups v1? → A: Group metadata only (no membership/owners).
- Q: What timezone/clock semantics apply for staleness/retention comparisons? → A: UTC everywhere.
- Q: How are sync run statuses defined (partial vs failed)? → A: Partial only if some pages processed and at least one upsert occurred; otherwise failed.
- Q: What paging “safety stop” bounds apply to full-tenant group enumeration (v1)? → A: Max 200 pages, max 10 minutes, abort on retry exhaustion; record safety-stop reason + counters.

## Pinned Decisions (v1 defaults)

These defaults are intentionally “hard” for Groups v1 to avoid interpretability during planning/implementation.

- **Schedule cadence default**: Scheduled sync runs daily at **02:00 UTC** (environment default). Manual sync is always available.
- **Auth mode (required)**: App-only (service principal). UI MUST NOT require delegated tokens.
- **Required Graph permission (application)**: `Group.Read.All`.
- **Graph API family**: Must work with Microsoft Graph **v1.0** semantics (no beta-only features required).
- **Paging strategy (v1)**: Full listing with `@odata.nextLink` paging. Delta sync is explicitly deferred to a future version.
- **Staleness default**: A group is “stale” if `last_seen_at < now() - 30 days` (computed in **UTC**, configurable per environment).
- **Retention/purge default**: Retain unseen groups for **90 days after last_seen_at** (computed in **UTC**), then purge.
- **Scope boundary**: No group membership/owners caching; no cross-tenant compare/promotion work inside this feature.

## Authorization & Access (Groups v1)

TenantPilot roles are tenant-scoped. Unless stated otherwise, all access below is limited to the active tenant context.

- **Browse cached groups (Directory → Groups)**: allowed for roles `Owner`, `Manager`, `Operator`, `Readonly`.
- **View group sync runs**: allowed for roles `Owner`, `Manager`, `Operator`, `Readonly`.
- **Start manual sync**: allowed for roles `Owner`, `Manager`, `Operator`.
- **Cross-tenant access**: forbidden (no browsing or resolution of another tenant’s cached groups).

## User Scenarios & Testing *(mandatory)*

### User Story 1 - Sync groups into a tenant-scoped cache (Priority: P1)

As a tenant admin, I can trigger a background sync that stores the latest observed Entra groups into a tenant-scoped cache, so the suite can display stable, human-friendly group names without relying on live directory lookups.

**Why this priority**: Without a cache, most assignment-heavy workflows become hard to use and hard to troubleshoot (only group IDs).

**Independent Test**: Trigger a sync for a tenant and verify that a run record exists and that group rows become available for that tenant.

**Acceptance Scenarios**:

1. **Given** I am in a tenant workspace, **When** I start a “Sync Groups” operation, **Then** a run is created and the sync executes asynchronously (request returns without waiting).
2. **Given** I start a groups sync for a tenant, **When** the sync completes successfully, **Then** the cache reflects the full set of groups in that tenant as of that run.
3. **Given** a sync run completes successfully, **When** I browse groups for that tenant, **Then** I see group entries with display names and stable identifiers.
4. **Given** a sync run completes partially or fails, **When** I view the run record, **Then** I can see status and a safe error summary that helps triage.
5. **Given** scheduled sync is enabled, **When** the schedule triggers, **Then** a run is created and executed without manual intervention and is visible to operators.
6. **Given** a groups sync is already running for the same tenant and selection, **When** I start another sync, **Then** the request is deduplicated (no second concurrent run is created) and I can identify the already-active run.
7. **Given** a scheduled sync run is created, **When** I view the run record, **Then** it is clearly identified as “scheduled/system-initiated” (no interactive user session required).

---

### User Story 2 - Browse groups (Priority: P2)

As a tenant admin, I can browse, search, and filter cached groups so I can quickly resolve group IDs to names and validate whether expected groups exist in the tenant.

**Why this priority**: Operators need a direct “source of truth (as last seen)” surface to debug restore mapping, dependencies, and drift findings.

**Independent Test**: After a sync run, open the groups list and verify search/filter/detail views work using only cached data.

**Acceptance Scenarios**:

1. **Given** groups have been synced, **When** I open “Directory → Groups”, **Then** I can search by display name and open a group detail view.
2. **Given** some groups were not observed recently, **When** I filter for “stale” groups, **Then** I see only groups whose last-seen timestamp is older than the staleness threshold.
3. **Given** no groups have been synced yet, **When** I open “Directory → Groups”, **Then** I see a clear empty-state that explains the cache is empty and offers a “Sync Groups” action.

---

### User Story 3 - Name resolution across the suite (Priority: P3)

As an operator, when other pages reference group IDs (dependencies, restore mapping, drift, compare), the UI shows a friendly label if the group exists in the cache; otherwise it shows a clear “unresolved” fallback.

**Why this priority**: This turns the cache into a foundational building block used across modules while keeping rendering safe and predictable.

**Independent Test**: Load a page that includes a group GUID reference and verify it renders with a name if present, and with a fallback if not present—without making any live directory calls during rendering.

**Acceptance Scenarios**:

1. **Given** a page references a group GUID that exists in the cache, **When** I view the page, **Then** I see `Group: <display name> (…last8)` (or equivalent) derived from cached data.
2. **Given** a page references a group GUID that does not exist in the cache, **When** I view the page, **Then** I see `Group (unresolved): …last8` (or equivalent) and the page still renders.
3. **Given** I view any page that renders group labels, **When** the page renders, **Then** it MUST NOT make any live directory calls (no Graph requests during render-time), and automated tests MUST fail hard if a Graph client is invoked.
4. **Given** no groups have been synced yet, **When** a page renders a group GUID reference, **Then** it still renders with the unresolved fallback (and does not attempt live directory lookup during render).

---

### Edge Cases

- **Throttling / transient failures (retry policy)**:
	- Retryable conditions for Graph reads: HTTP `429`, `503`, and network timeouts.
	- Non-retryable (fail-fast): HTTP `403` (permission), and other non-2xx responses unless explicitly categorized as retryable.
	- Backoff strategy: exponential backoff with jitter (full jitter), capped.
	- Max retries (total per run): 8 (aligned with safety stop `retry_exhausted`).
	- If retries are exhausted: run MUST abort with `safety_stop_triggered=true`, `safety_stop_reason=retry_exhausted`, and status MUST follow FR-002a / CR-002a (partial if any upserts, else failed).
	- Operator-visible: run record MUST show `error_category=throttling` (or `transient` for timeouts), and a safe summary including retry count and last HTTP status (if available).

- **Permission missing (forbidden)**:
	- If Graph returns HTTP `403` on group list, the run MUST stop immediately (no retries), with `error_category=permission`, and status MUST be `failed`.
	- Operator-visible guidance MUST be explicit: `Group.Read.All` (application permission) missing and/or admin consent missing.

- **Groups disappear / reappear**:
	- If a group is not observed, it is retained per FR-005a and may still resolve labels until purged.
	- If the same group ID reappears within the retention window, the record MUST be updated and `last_seen_at` refreshed.

- **Large tenants**:
	- Runs MUST respect CR-002a safety-stop bounds (max pages / max runtime) to prevent runaway cost and load.
	- UI browse/search MUST operate on cached DB data only and rely on indexes (no live Graph during render-time).

## Out of Scope (Groups v1)

- Caching group **membership** or **owners**.
- Any UI behavior that requires delegated Graph tokens for group name resolution.
- Cross-tenant compare/promotion of groups.
- Delta-sync based directory change tracking.

## Requirements *(mandatory)*

**Constitution alignment (required):** If this feature introduces any external directory calls or any write/change behavior,
the spec MUST describe contract registry updates, safety gates (preview/confirmation/audit), tenant isolation, and tests.

### Assumptions & Dependencies

- The application already has a tenant context concept; this cache is scoped strictly to the active tenant.
- Background processing infrastructure exists (queue worker) so sync can run asynchronously.
- The tenant has (or can be granted) sufficient directory read permissions to list groups.
- Directory reads for group sync run using app-only (service principal) permissions.
- Required Graph permission for Groups v1 is `Group.Read.All` (application permission).
- Groups v1 does not require Graph beta-only capabilities.
- Other modules that display group references can integrate via a shared “group name resolution” capability.
- Groups v1 sync scope is all groups in the tenant (not only groups already referenced by TenantAtlas).
- The system can execute scheduled background work (e.g., cron/scheduler) to run periodic group sync.
- The system can enforce retention/purge for cached groups that have not been observed for a configured period.
- Groups v1 cache stores group metadata only and does not store group membership or owners.

### Functional Requirements

- **FR-001 (Tenant-scoped cache)**: System MUST store a tenant-scoped cache of Entra groups in the application, and it MUST be read-only with respect to Entra.
- **FR-001a (Cached fields; no membership caching)**: Groups v1 caching MUST store only group metadata needed for name resolution and browsing, and MUST NOT store group membership or owners. Cached fields for v1 MUST include: `id`, `displayName`, `groupTypes`, `securityEnabled`, `mailEnabled`, and `last_seen_at`.
- **FR-002 (Sync runs)**: System MUST create an append-only “sync run” record for each sync attempt, including lifecycle status and basic counters (observed/upserted/errors).
- **FR-002a (Status semantics: partial vs failed)**: The run MUST use explicit statuses: `pending`, `running`, `succeeded`, `failed`, `partial`. Criteria:
	- `succeeded`: all pages processed to completion.
	- `partial`: at least one page was processed AND `upserted_count > 0`, but the run did not complete successfully (e.g., aborted due to repeated throttling, transient faults, or other non-fatal error conditions).
	- `failed`: zero progress (no processed pages) OR `upserted_count = 0` due to a fatal condition (e.g., missing permission) or immediate abort.
	The run record MUST include `error_category` (see FR-011) and a safe `error_summary` when status is `failed` or `partial`.
- **FR-002b (Run record: paging + safety stop fields)**: The run record MUST include `pages_fetched`, `items_observed_count`, `items_upserted_count`, and (when applicable) `safety_stop_triggered` plus `safety_stop_reason`.
- **FR-003 (Async only)**: Starting a groups sync MUST dispatch background work and MUST NOT perform full sync work in the initiating HTTP request.
- **FR-004 (Idempotent selection)**: System MUST support a deterministic “selection identifier” for the v1 groups sync scope so repeated requests with the same selection can be recognized and deduplicated. For Groups v1, the selection identifier MUST be the stable string `groups-v1:all`.
- **FR-004a (Full tenant scope)**: For Groups v1, the sync MUST attempt to enumerate all groups visible in the tenant scope (subject to permissions and provider limits).
- **FR-004b (Start modes + cadence default)**: The system MUST support starting groups sync both manually (operator-initiated) and on a periodic schedule. The default scheduled cadence for Groups v1 MUST be daily at 02:00 UTC (configurable per environment).
- **FR-005 (Staleness)**: System MUST track when a group was last observed and MUST allow identifying “stale” groups as “not observed for N days”; for v1, default N = 30 and configurable per environment.
- **FR-005 (Staleness)**: System MUST track when a group was last observed and MUST allow identifying “stale” groups as “not observed for N days”; for v1, default N = 30 and configurable per environment. All comparisons MUST be computed in **UTC**.
- **FR-005a (Retention & purge)**: If a group is not observed in subsequent full-tenant syncs, the system MUST retain the cached record for 90 days after its last observed timestamp (`last_seen_at`), and it MUST purge the record after that retention window. All comparisons MUST be computed in **UTC**.
- **FR-005b (Disappear / reappear semantics)**: The cache MUST behave deterministically when a group disappears and later reappears:
	- If a group is not observed in a run, its cached row MUST remain unchanged (including `last_seen_at`) until either it is observed again or it is purged per FR-005a.
	- If the same group `id` is observed again **within** the retention window, the system MUST update the existing row (refresh `displayName`, flags/types as returned, and set `last_seen_at` to the run’s observation time).
	- If the group was already purged (retention window elapsed) and later reappears, the system MUST create a new cached row for that group `id` on observation.
- **FR-006 (UI safety + guard test)**: UI rendering for directory groups and for name resolution MUST use cached data only (no live directory calls at render time). The feature MUST include a test that fails hard if the Graph client is invoked during render.
- **FR-006a (Definition: render-time)**: “Render-time” means the synchronous request lifecycle that produces UI output (Filament pages, Livewire component renders, and any server-side code executed to build the response). During render-time, the system MUST NOT call Microsoft Graph for group data. Background work (queued jobs, scheduled commands) MAY call Graph.
- **FR-007 (Search & filters)**: Users MUST be able to search cached groups by name and filter by at least “stale vs. fresh” and “group type” (security vs. M365 group) when that info is available.
- **FR-008 (Cross-module resolution)**: When other modules reference a group ID, the system MUST resolve it to a friendly label from the cache when available, and MUST show a clear unresolved fallback when not available.
- **FR-009 (Audit & observability)**: Starting a sync and completing/failing a sync MUST be auditable (who initiated, when, status), and the operator MUST be able to view the run record.
- **FR-009a (Audit minimum fields + visibility)**: Each sync attempt MUST produce (1) a sync run record visible in the admin UI, and (2) audit entries for start and finish/failure visible in the tenant audit log. Minimum audit metadata for these entries: `tenant`, `action`, `status`, `run_id`, `selection_key`, initiator identity (user id or “system”), and on completion: `observed_count`, `upserted_count`, `error_count`, `error_category` (if any).
- **FR-010 (Tenant isolation)**: All group cache data and sync run data MUST be strictly tenant-scoped; cross-tenant access MUST be prevented by authorization.
- **FR-011 (Error hygiene + categories)**: Failure details stored for runs MUST be safe to display (no secrets), and SHOULD be summarized into stable categories. For v1, the supported categories are: `permission`, `throttling`, `transient`, `unknown`.
- **FR-011b (Retry & backoff policy; v1)**: The sync implementation MUST apply a consistent retry policy for Graph group listing:
	- Retryable: HTTP `429`, `503`, and network timeouts.
	- Backoff: exponential backoff with jitter (full jitter), capped at 60 seconds per delay.
	- Max retries: 8 total per run.
	- On retry exhaustion: abort per CR-002a (`retry_exhausted`) and set status per FR-002a.
- **FR-011c (Permission-missing operator UX)**: When a run fails due to missing permissions (HTTP `403`), the operator-facing UI MUST display a stable error code and guidance.
	- Error code: `graph_forbidden`.
	- Guidance: “Grant `Group.Read.All` (application permission) and admin consent for the tenant, then retry.”
- **FR-011d (Throttling/transient operator UX)**: When a run aborts due to throttling/transient retry exhaustion, the operator-facing UI MUST display a stable warning/error code and a safe summary.
	- Error/warning code: `graph_throttled` for repeated `429/503`; `graph_timeout` for timeouts.
	- Summary MUST include: retry count (up to 8) and whether the safety stop was triggered.
- **FR-011a (Auth mode + required permission)**: Directory reads for groups sync MUST use app-only (service principal) authorization and MUST NOT depend on an interactive user session. The required Graph permission is `Group.Read.All` (application).
- **FR-012 (Tests)**: The feature MUST include automated tests covering tenant isolation, basic sync-run lifecycle persistence, and “UI pages render without live directory calls,” including a fail-hard guard test that asserts the Graph client is not invoked during render.

### Scheduled Sync Semantics

- **SS-001 (Initiator identity)**: Scheduled sync runs MUST be recorded as system-initiated (no user initiator).
- **SS-002 (Visibility)**: Operators MUST be able to distinguish scheduled vs manual runs when viewing run records.
- **SS-003 (Schedule dedupe)**: Scheduled dispatch MUST NOT create duplicate runs for the same tenant + selection in the same schedule slot. If a run for the current slot already exists (or an active run is in progress), the dispatcher MUST skip creating a second run.

### Contract Requirements

- **CR-001 (Graph contract registry)**: The feature MUST register the Groups v1 directory read contract in the Graph contract registry.
- **CR-002 (List endpoint + select fields)**: Sync MUST read groups via `GET /groups` with `$select=id,displayName,groupTypes,securityEnabled,mailEnabled` and MUST page via `@odata.nextLink` until completion (or a documented safety stop).
- **CR-002a (Safety stop: bounds + abort criteria)**: Sync MUST enforce safety-stop bounds for Groups v1 to prevent runaway runs:
	- **Max pages**: 200 pages per run.
	- **Max runtime**: 10 minutes per run.
	- **Abort criteria (immediate)**: stop the run if runtime exceeds max runtime, pages exceed max pages, or the Graph client exceeds 8 total retries for retryable throttling/transient conditions (e.g., repeated 429/503) (“retry exhausted”).
	- **Abort status**: if `items_upserted_count > 0`, mark run as `partial`; otherwise mark run as `failed`.
	- **Run record**: set `safety_stop_triggered=true` and set `safety_stop_reason` to one of: `max_pages`, `max_runtime`, `retry_exhausted`.
- **CR-003 (Delta strategy deferred)**: Delta endpoints/strategies MUST NOT be used in Groups v1.

### Key Entities *(include if feature involves data)*

- **EntraGroup**: A tenant-scoped cached record representing an Entra group (external ID, display name, group type/flags when available, last observed timestamp). Groups v1 stores metadata only.
- **EntraGroupSyncRun**: A tenant-scoped run record representing one attempt to sync groups (status, timestamps, counters, safe error summary, initiator).
- **GroupReference**: A cross-module reference to a group by ID that can be resolved (or not) via the cache.

## Success Criteria *(mandatory)*

### Measurable Outcomes

- **SC-001 (Resolve time)**: For a tenant with cached groups, operators can resolve a group ID to a human-friendly label in under 30 seconds.
	- **Measured from**: UI workflow time (manual stopwatch) for either (a) Directory → Groups search + open detail OR (b) in-context label rendering.
	- **Scope/window**: 95th percentile over 20 attempts on a representative tenant with cached groups.
	- **Pass/fail reporting**: recorded as a QA note for the release gate.

- **SC-002 (Render resilience)**: Directory/Groups pages render successfully even when the external directory API is unavailable.
	- **Measured from**: server-side request completion (HTTP 200) while the sync job’s Graph calls are failing/blocked (e.g., simulated outage), demonstrating the UI does not depend on live Graph during render-time.
	- **Scope/window**: 20 page loads across Directory → Groups list and detail.
	- **Pass/fail reporting**: QA note; must remain true for all future pages integrating the label resolver.

- **SC-003 (Label resolution rate)**: After a successful sync, at least 95% of group GUID references on supported pages resolve to a friendly label.
	- **Measured from**: page output inspection against the cached DB state (resolved label present vs unresolved fallback) for a representative tenant.
	- **Scope/window**: sample of supported pages + a set of group GUID references observed in the last successful run; target is 95% resolved.
	- **Pass/fail reporting**: QA note with sample size and tenant used.

- **SC-004 (End-to-end operator workflow time)**: Operators can complete “Sync Groups → Verify group exists → Use group in mapping” in under 3 minutes.
	- **Measured from**: UI workflow time (manual stopwatch) plus sync run duration from DB.
	- **Scope/window**: 95th percentile over the last 20 runs per tenant + selection key (`groups-v1:all`).
	- **Reporting requirement**: run detail UI MUST show `started_at`, `finished_at`, computed `duration_seconds`, and counters (`items_observed_count`, `items_upserted_count`, `error_count`).