TenantAtlas/specs/096-ops-polish-assignment-dedupe-system-tracking/spec.md
ahmido 03127a670b Spec 096: Ops polish (assignment summaries + dedupe + reconcile tracking + seed DX) (#115)
Implements Spec 096 ops polish bundle:

- Persist durable OperationRun.summary_counts for assignment fetch/restore (final attempt wins)
- Server-side dedupe for assignment jobs (15-minute cooldown + non-canonical skip)
- Track ReconcileAdapterRunsJob via workspace-scoped OperationRun + stable failure codes + overlap prevention
- Seed DX: ensure seeded tenants use UUID v4 external_id and seed satisfies workspace_id NOT NULL constraints

Verification (local / evidence-based):
- `vendor/bin/sail artisan test --compact tests/Feature/Operations/AssignmentRunSummaryCountsTest.php tests/Feature/Operations/AssignmentJobDedupeTest.php tests/Feature/Operations/ReconcileAdapterRunsJobTrackingTest.php tests/Feature/Seed/PoliciesSeederExternalIdTest.php`
- `vendor/bin/sail bin pint --dirty`

Spec artifacts included under `specs/096-ops-polish-assignment-dedupe-system-tracking/` (spec/plan/tasks/checklists).

Co-authored-by: Ahmed Darrazi <ahmed.darrazi@live.de>
Reviewed-on: #115
2026-02-15 20:49:38 +00:00

144 lines
9.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Feature Specification: 096 — Ops Polish Bundle (Assignment job summaries + job dedupe + system job tracking + seeder DX)
**Feature Branch**: `096-ops-polish-assignment-dedupe-system-tracking`
**Created**: 2026-02-15
**Status**: Draft
**Input**: User description: "096 — Ops Polish Bundle (Assignment job summaries + job dedupe + system job tracking + seeder DX)"
## Spec Scope Fields *(mandatory)*
- **Scope**: tenant
- **Primary Routes**: None (background operations only)
- **Data Ownership**: tenant-owned operational records (run tracking + run outcomes) and tenant seed data
- **RBAC**: No new permissions; no changes to user-facing authorization behavior (existing gates/policies remain the source of truth for starting operations)
## Clarifications
### Session 2026-02-15
- Q: What is the dedupe identity rule for assignment jobs? → A: Use `operation_run_id` when available; otherwise dedupe by `tenant_id + job_type + stable input fingerprint` (no secrets).
- Q: If an assignment job retries, how should `OperationRun` summary counters be persisted? → A: On completion (success or terminal failure), write/overwrite the runs counters so they reflect the final attempt.
- Q: What deduplication duration should we enforce for assignment jobs? → A: 15 minutes.
- Q: For seeded tenants, what format should `tenants.external_id` use? → A: UUID string (v4).
- Q: What `OperationRun.type` should we use for `ReconcileAdapterRunsJob` tracking? → A: `ops.reconcile_adapter_runs`.
## User Scenarios & Testing *(mandatory)*
### User Story 1 - Assignment runs show durable summaries (Priority: P2)
As an operator, I want assignment-related background runs to persist consistent summary counters so that run outcomes can be audited reliably (especially under retries).
**Why this priority**: These jobs already run in production; missing summaries reduce observability and make incident triage slower.
**Independent Test**: Dispatch one assignment job run, then verify the recorded run summary includes total/processed/failed and remains correct after retry simulation.
**Acceptance Scenarios**:
1. **Given** an assignment fetch run completes, **When** its run record is inspected, **Then** it includes non-null summary counters for total, processed, and failed.
2. **Given** an assignment restore run completes, **When** its run record is inspected, **Then** it includes non-null summary counters for total, processed, and failed.
3. **Given** an assignment run retries due to transient failure, **When** it ultimately completes, **Then** its summary counters do not double-count work across attempts.
---
### User Story 2 - Duplicate dispatches do not overlap (Priority: P2)
As an operator, I want assignment jobs to be deduplicated by identity so accidental double dispatch (or queue redelivery) does not cause concurrent overlapping work.
**Why this priority**: Duplicate concurrency increases the risk of conflicting writes, rate limiting, and confusing operational outcomes.
**Independent Test**: Attempt to dispatch the same job identity twice and assert only one execution proceeds while the other is deduped/skipped.
**Acceptance Scenarios**:
1. **Given** an assignment job with a stable identity is dispatched twice in a short window, **When** workers attempt to execute both, **Then** only one execution proceeds (no concurrent overlap) for that identity.
2. **Given** an assignment job with a stable identity completed recently, **When** the same identity is dispatched again within the deduplication duration, **Then** the new execution is skipped/deduped.
3. **Given** the dedupe duration has elapsed since the last terminal completion for an identity, **When** the same identity is legitimately run again, **Then** the new execution is allowed to proceed.
---
### User Story 3 - Housekeeping runs are tracked like everything else (Priority: P3)
As an operator, I want housekeeping/system jobs to produce the same run tracking as other operations so that the operations ledger is complete.
**Why this priority**: This is operational completeness. It is not ship-blocking but improves consistency and reduces blind spots.
**Independent Test**: Execute a housekeeping run and verify a run record is created/updated with success/failure outcome details.
**Acceptance Scenarios**:
1. **Given** the reconcile-adapter-runs job executes successfully, **When** the operations ledger is inspected, **Then** a run record exists with a success outcome.
2. **Given** the reconcile-adapter-runs job fails, **When** the operations ledger is inspected, **Then** the run record includes a stable reason code and a sanitized error message suitable for operators.
---
### User Story 4 - Fresh seed flows work without manual intervention (Priority: P3)
As a developer, I want local and CI seed flows to run successfully so that onboarding and test environments are reproducible.
**Why this priority**: Broken seeding slows development and increases setup variability.
**Independent Test**: Run a clean database reset with seeding and confirm it completes without constraint errors.
**Acceptance Scenarios**:
1. **Given** a clean database, **When** a full reset-and-seed workflow is executed, **Then** it succeeds without requiring manual edits.
2. **Given** seeded tenants exist, **When** their records are validated, **Then** required identifiers are populated, and `external_id` is a UUID v4 string.
### Edge Cases
- Retries: If a job attempt fails and retries, summary counters must remain consistent (no double counting).
- Partial failure: If some items fail, the run must record failures without obscuring successful processing.
- Deduplication window: Dedupe should block concurrent overlap but must not prevent legitimate future runs after the window.
- Error reporting: Failure messages must be sanitized and stable enough to support searching and alerting.
## Requirements *(mandatory)*
**Constitution alignment (required):** This feature touches queued/background work and run observability. It must:
- maintain tenant isolation (no cross-tenant leakage in run tracking),
- ensure run observability is complete for the jobs in scope,
- and add automated tests for the changed operational behavior.
**Constitution alignment (RBAC-UX):** No new UI surfaces are added and no authorization behavior is changed. Any existing authorization checks for starting operations remain server-side and unchanged.
**Constitution alignment (Filament Action Surfaces):** Not applicable (no Filament Resources/Pages/RelationManagers are added or modified).
### Functional Requirements
- **FR-001**: The system MUST persist summary counters (total, processed, failed) for assignment fetch runs upon completion.
- **FR-002**: The system MUST persist summary counters (total, processed, failed) for assignment restore runs upon completion.
- **FR-003**: Summary counters MUST be idempotent-friendly: on completion (success or terminal failure), the run summary counters MUST reflect the final attempt and retries MUST NOT double-count totals for the same run identity.
- **FR-004**: Assignment jobs MUST enforce server-side deduplication by a stable, non-secret job identity to prevent concurrent overlap.
- **FR-004a**: The job identity MUST be derived as: `operation_run_id` when available; otherwise `tenant_id + job_type + stable input fingerprint`.
- **FR-004b**: The stable input fingerprint MUST be deterministic and MUST NOT include secrets.
- **FR-005**: The deduplication duration MUST cover expected worst-case runtime while allowing legitimate future executions after it expires.
- **FR-005a**: The deduplication duration MUST be 15 minutes.
- **FR-006**: Deduplication MUST be enforced at execution time (not solely by UI gating or caller-side checks).
- **FR-007**: The reconcile-adapter-runs housekeeping job MUST create/update an operational run record each time it executes.
- **FR-007a**: The reconcile-adapter-runs housekeeping job MUST use `OperationRun.type = ops.reconcile_adapter_runs`.
- **FR-008**: On failure, the housekeeping run record MUST include a stable reason code and a sanitized operator-facing error message.
- **FR-009**: The seed workflow MUST populate required tenant identifiers so database constraints are satisfied.
- **FR-009a**: Seeded tenants MUST have `external_id` populated as a UUID string (v4).
- **FR-010**: A clean reset-and-seed workflow MUST succeed without manual intervention.
### Assumptions & Dependencies
- The system already has a durable operations ledger concept (run tracking) that can store outcomes and summary counters.
- Assignment fetch/restore jobs already produce item-level totals (or can derive them) without changing business behavior.
- Dedupe is evaluated based on a stable identity that is safe to store and log (no secrets).
- No new UI, routes, or end-user workflows are introduced by this work.
### Key Entities *(include if feature involves data)*
- **Operation Run**: A durable record of a background operation execution, its outcome, and its summary counters.
- **Job Identity**: A stable identifier used to deduplicate concurrent executions of the same logical work.
- **Tenant**: The scope boundary for operational records and seeded data.
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-001**: 100% of completed assignment fetch and restore runs show persisted summary counters (total/processed/failed) in the operations ledger.
- **SC-002**: Duplicate dispatch attempts for the same assignment job identity result in at most one concurrent execution within the deduplication window.
- **SC-003**: 100% of reconcile-adapter-runs executions produce an operational run record with a success or failure outcome.
- **SC-004**: A clean reset-and-seed workflow completes successfully in CI and locally without database constraint failures.