TenantAtlas/specs/194-governance-friction-hardening/research.md

# Research: Governance Friction Hardening and Operator Vocabulary

## Decision: Introduce one narrow governance-action catalog instead of a new governance workflow framework

### Rationale

Spec 194 needs one project-wide, testable source for friction class, reason policy, danger expectation, and canonical vocabulary across actions that already exist on multiple surfaces. The repo already has several concrete governance families: exception decisions, review lifecycle, evidence lifecycle, run triage, finding lifecycle, and tenant lifecycle. That is enough real variance to justify one small derived catalog, but not a new runtime workflow engine.

### Alternatives considered

- Keep all semantics page-local and document them only in the spec: rejected because local copy and modal logic would drift again and CI could not enforce the rules.
- Build a full governance action framework with custom builders, registries, and resolvers: rejected because the repo only needs shared semantics, not a second execution engine.

## Decision: Keep existing mutation services and audit loggers as owners of state change

### Rationale

The current services already own the actual lifecycle mutation and most audit logging:

- `FindingExceptionService` for approve, reject, renew, revoke
- `TenantReviewLifecycleService` for publish and archive
- `EvidenceSnapshotService` for refresh and expire
- `OperationRunTriageService` for retry, cancel, and mark investigated
- `FindingWorkflowService` for close and reopen
- `TenantResource` lifecycle helpers plus `WorkspaceAuditLogger` for archive and restore

The narrowest correct implementation is to align UI semantics and extend service inputs or audit metadata only where Spec 194 requires stronger reason propagation.

### Alternatives considered

- Move lifecycle mutations into a new shared governance service layer: rejected because it would duplicate working domain services and add coordination overhead without solving a new business problem.
- Keep reason capture only in UI and not in service-level inputs: rejected because Spec 194 requires reasons to remain audit-visible and not be purely presentational.

## Decision: Treat reason capture as a family contract, not a local modal choice

### Rationale

Current repo behavior is inconsistent:

- Exception family already captures reasons across all four major actions.
- Review publish or archive capture no reason.
- Evidence refresh or expire capture no reason.
- System run triage captures reason only for `Mark investigated`, not for `Cancel`.
- Finding `Close` captures reason, but `Reopen` does not.
- Tenant archive or restore capture no reason.

Spec 194 therefore must define reason policy by family and then drive the UI forms and service inputs from that rule.

### Alternatives considered

- Leave reason capture to each page owner: rejected because it produced the current inconsistency.
- Force a reason on every action: rejected because it would over-harden F0 and F1 actions and reduce operator velocity without safety benefit.

## Decision: Distinguish technical refresh from formal governance lifecycle

### Rationale

The repo already shows that similarly placed actions do not have equivalent business meaning:

- `Refresh evidence` is operational regeneration of data.
- `Expire snapshot` formally invalidates a governance artifact.
- `Refresh review` is operational recomputation.
- `Publish review` is a formal release step.
- `Retry` is follow-up work.
- `Cancel` is a stronger intervention.

Spec 194 should therefore classify by business impact, not by whether the action appears in a header or uses the same Filament primitive.

### Alternatives considered

- Classify by surface location: rejected because the same family appears on queue, detail, workspace, and system pages.
- Classify by current button color: rejected because current color usage is part of the inconsistency.

## Decision: Use canonical operator vocabulary per family and prohibit casual synonyms

### Rationale

The same domain effect should not oscillate between verbs. The current repo already has stable families that can be hardened:

- `Approve / Reject`
- `Renew exception / Revoke exception`
- `Publish review / Archive review / Create next review`
- `Refresh evidence / Expire snapshot`
- `Close / Reopen`
- `Retry / Cancel / Mark investigated`
- `Archive / Restore`

Spec 194 should preserve those families and use them consistently in action labels, modal headings, notifications, and audit wording.

### Alternatives considered

- Allow page-specific synonyms where copy “reads better”: rejected because operator ambiguity is precisely the problem this spec is solving.
- Rename everything to one generic lifecycle lexicon: rejected because different domains still need domain-specific objects and verbs.

## Decision: Keep the new semantics derived and guardable, not persisted

### Rationale

The new friction classes and reason policies are product rules, not new domain records. They do not need their own table or long-lived artifact. A derived catalog plus tests is enough to make the rules explicit, reviewable, and regression-safe.

### Alternatives considered

- Persist the matrix in the database or a user-editable admin screen: rejected because the semantics are part of product behavior, not tenant-owned configuration.
- Leave the matrix only in documentation: rejected because the repo needs an enforceable regression gate.

## Decision: Reuse the existing test layering already proven in this repo

### Rationale

The repo already has the right three layers for Spec 194:

- Guard tests for contract-level invariants
- Focused feature or RBAC tests around concrete surfaces and services
- Browser smoke tests for cross-surface operator flows

This gives durable coverage without overbuilding.

### Alternatives considered

- Browser-test every friction permutation: rejected because service and page tests already cover most of the logic more cheaply.
- Add only a unit test for the catalog: rejected because surface wiring and authorization semantics would remain unverified.

## Decision: Align the highest-risk families first

### Rationale

The strongest current inconsistencies and operator risks are concentrated in:

- Exception decision and lifecycle actions
- Review publication and archival
- Evidence expiry semantics
- System run triage

These should be aligned before lower-risk supporting families such as tenant restore or navigation-adjacent actions.

### Alternatives considered

- Start with the broadest surface rollout: rejected because it would spread effort without first hardening the most consequential actions.
- Start with tenant lifecycle only: rejected because exception, review, evidence, and run triage already carry higher governance importance.