tenantpilot/specs/005-backend-arch-pivot/plan.md

# Implementation Plan: Backend Architecture Pivot

**Feature Branch**: `005-backend-arch-pivot`
**Created**: 2025-12-09
**Spec**: [spec.md](./spec.md)
**Status**: Ready for Implementation

---

## Executive Summary

**Goal**: Migrate from n8n Low-Code backend to TypeScript Code-First backend with BullMQ job queue for Policy synchronization.

**Impact**: Removes external n8n dependency, improves maintainability, enables AI-assisted refactoring, and provides foundation for future scheduled sync features.

**Complexity**: HIGH - Requires new infrastructure (Redis, BullMQ), worker process deployment, and careful data transformation logic porting.

---

## Technical Context

### Current Architecture (n8n-based)

```
User clicks "Sync Now"
    ↓
Server Action: triggerPolicySync()
    ↓
HTTP POST → n8n Webhook (N8N_SYNC_WEBHOOK_URL)
    ↓
n8n Workflow:
  1. Microsoft Graph Authentication
  2. Fetch Policies (4 endpoints with pagination)
  3. JavaScript Code Node: Deep Flattening Logic
  4. HTTP POST → TenantPilot Ingestion API
    ↓
API Route: /api/policy-settings (validates POLICY_API_SECRET)
    ↓
Drizzle ORM: Insert/Update policy_settings table
```

**Problems**:
- External dependency (n8n instance required)
- Complex transformation logic hidden in n8n Code Node
- Hard to test, version control, and refactor
- No AI assistance for n8n code
- Additional API security layer needed (POLICY_API_SECRET)

### Target Architecture (BullMQ-based)

```
User clicks "Sync Now"
    ↓
Server Action: triggerPolicySync()
    ↓
BullMQ: Add job to Redis queue "intune-sync-queue"
    ↓
Worker Process (TypeScript):
  1. Microsoft Graph Authentication (@azure/identity)
  2. Fetch Policies (4 endpoints with pagination)
  3. TypeScript: Deep Flattening Logic
  4. Drizzle ORM: Direct Insert/Update
    ↓
Database: policy_settings table
```

**Benefits**:
- No external dependencies (Redis only)
- All logic in TypeScript (version-controlled, testable)
- AI-assisted refactoring possible
- Simpler security model (no API bridge)
- Foundation for scheduled syncs

---

## Constitution Check *(mandatory)*

### Compliance Verification

| Principle | Status | Notes |
|-----------|--------|-------|
| **I. Server-First Architecture** | ✅ COMPLIANT | Worker uses Server Actions pattern (background job processing), no client fetches |
| **II. TypeScript Strict Mode** | ✅ COMPLIANT | All worker code in TypeScript strict mode, fully typed Graph API responses |
| **III. Drizzle ORM Integration** | ✅ COMPLIANT | Worker uses Drizzle for all DB operations, no raw SQL |
| **IV. Shadcn UI Components** | ✅ COMPLIANT | No UI changes (frontend only triggers job, uses existing components) |
| **V. Azure AD Multi-Tenancy** | ✅ COMPLIANT | Uses existing Azure AD Client Credentials for Graph API access |

### Risk Assessment

**HIGH RISK**: Worker deployment as separate process (requires Docker Compose update, PM2/Systemd config)

**MEDIUM RISK**: Graph API rate limiting handling (needs robust retry logic)

**LOW RISK**: BullMQ integration (well-documented library, standard Redis setup)

### Justification

Architecture pivot necessary to:
1. Remove external n8n dependency (reduces operational complexity)
2. Enable AI-assisted development (TypeScript vs. n8n visual flows)
3. Improve testability (unit/integration tests for worker logic)
4. Prepare for Phase 2 features (scheduled syncs, multi-tenant parallel processing)

**Approved**: Constitution compliance verified, complexity justified by maintainability gains.

---

## File Tree & Changes

```
tenantpilot/
├── .env                             # [MODIFIED] Add REDIS_URL, remove POLICY_API_SECRET + N8N_SYNC_WEBHOOK_URL
├── (Redis provided by deployment)   # No `docker-compose.yml` required; ensure `REDIS_URL` is set by Dokploy
├── package.json                     # [MODIFIED] Add bullmq, ioredis, @azure/identity, tsx dependencies
│
├── lib/
│   ├── env.mjs                      # [MODIFIED] Add REDIS_URL validation, remove POLICY_API_SECRET + N8N_SYNC_WEBHOOK_URL
│   ├── queue/
│   │   ├── redis.ts                 # [NEW] Redis connection for BullMQ
│   │   └── syncQueue.ts             # [NEW] BullMQ Queue definition for "intune-sync-queue"
│   └── actions/
│       └── policySettings.ts        # [MODIFIED] Replace n8n webhook call with BullMQ job creation
│
├── worker/
│   ├── index.ts                     # [NEW] BullMQ Worker entry point
│   ├── jobs/
│   │   ├── syncPolicies.ts          # [NEW] Main sync orchestration logic
│   │   ├── graphAuth.ts             # [NEW] Azure AD token acquisition
│   │   ├── graphFetch.ts            # [NEW] Microsoft Graph API calls with pagination
│   │   ├── policyParser.ts          # [NEW] Deep flattening & transformation logic
│   │   └── dbUpsert.ts              # [NEW] Drizzle ORM upsert operations
│   └── utils/
│       ├── humanizer.ts             # [NEW] Setting ID humanization
│       └── retry.ts                 # [NEW] Exponential backoff retry logic
│
├── app/api/
│   ├── policy-settings/
│   │   └── route.ts                 # [DELETED] n8n ingestion API no longer needed
│   └── admin/
│       └── tenants/
│           └── route.ts             # [DELETED] n8n polling API no longer needed
│
└── specs/005-backend-arch-pivot/
    ├── spec.md                      # ✅ Complete
    ├── plan.md                      # 📝 This file
    ├── technical-notes.md           # ✅ Complete (implementation reference)
    └── tasks.md                     # 🔜 Generated next
```

---

## Phase Breakdown

### Phase 1: Setup & Infrastructure (T001-T008)

**Goal**: Prepare environment, install dependencies, and wire the app to the provisioned Redis instance

**Tasks**:
- T001: Confirm `REDIS_URL` is provided by Dokploy and obtain connection details
- T002-T004: Add `REDIS_URL` to local `.env` (for development) and to `lib/env.mjs` runtime validation
- T005: Install npm packages: `bullmq`, `ioredis`, `@azure/identity`, `tsx`
- T006-T007: Create Redis connection and BullMQ Queue
- T008: Test infrastructure (connect to provided Redis from local/dev environment)

**Deliverables**:
- Connection details for Redis from Dokploy documented
- Environment variables validated (local + deploy)
- Dependencies in `package.json`
- Queue operational using the provided Redis

**Exit Criteria**: `npm run dev` starts without env validation errors and the queue accepts jobs against the provided Redis

---

### Phase 2: Worker Process Skeleton (T009-T014)ntry point and basic job processing infrastructure

**Tasks**:
- T009: Create `worker/index.ts` - BullMQ Worker entry point
- T010-T012: Add npm script, event handlers, structured logging
- T013: Create sync orchestration skeleton
- T014: Test worker startup

**Deliverables**:
- Worker process can be started via `npm run worker:start`
- Jobs flow from queue → worker
- Event logging operational

**Exit Criteria**: Worker logs "Processing job X" when job is added to queue

---

### Phase 3: Microsoft Graph Integration (T015-T023)ion and Microsoft Graph API data fetching with pagination

**Tasks**:
- T015-T017: Create `worker/jobs/graphAuth.ts` - Azure AD token acquisition
- T018-T021: Create `worker/jobs/graphFetch.ts` - Fetch from 4 endpoints with pagination
- T022: Create `worker/utils/retry.ts` - Exponential backoff
- T023: Test with real tenant data

**Deliverables**:
- `getGraphAccessToken()` returns valid token
- `fetchAllPolicies()` returns all policies from 4 endpoints
- Pagination handled correctly (follows `@odata.nextLink`)
- Rate limiting handled with retry

**Exit Criteria**: Worker successfully fetches >50 policies for test tenant

---

### Phase 4: Data Transformation (T024-T035)

**Goal**: Port n8n flattening logic to TypeScript

**Tasks**:
1. Create `worker/jobs/policyParser.ts` - Policy type detection & routing
2. Implement Settings Catalog parser (`settings[]` → flat key-value)
3. Implement OMA-URI parser (`omaSettings[]` → flat key-value)
4. Create `worker/utils/humanizer.ts` - Setting ID humanization
5. Handle empty policies (default placeholder setting)
6. Test: Parse sample policies, verify output structure

**Deliverables**:
- `parsePolicySettings()` converts Graph response → FlattenedSetting[]
- Humanizer converts technical IDs → readable names
- Empty policies get "(No settings configured)" entry

**Exit Criteria**: 95%+ of sample settings are correctly extracted and formatted

---

### Phase 5: Database Persistence (T036-T043)

**Goal**: Implement Drizzle ORM upsert logic

**Tasks**:
1. Create `worker/jobs/dbUpsert.ts` - Batch upsert with conflict resolution
2. Use existing `policy_settings` table schema
3. Leverage `policy_settings_upsert_unique` constraint (tenantId + graphPolicyId + settingName)
4. Update `lastSyncedAt` on every sync
5. Test: Run full sync, verify data in DB

**Deliverables**:
- `upsertPolicySettings()` inserts new & updates existing settings
- No duplicate settings created
- `lastSyncedAt` updated correctly

**Exit Criteria**: Full sync for test tenant completes successfully, data visible in DB

---

### Phase 6: Frontend Integration (T044-T051)

**Goal**: Replace n8n webhook with BullMQ job creation

**Tasks**:
1. Modify `lib/actions/policySettings.ts` → `triggerPolicySync()`
2. Remove n8n webhook call (`fetch(env.N8N_SYNC_WEBHOOK_URL)`)
3. Replace with BullMQ job creation (`syncQueue.add(...)`)
4. Return job ID to frontend
5. Test: Click "Sync Now", verify job created & processed

**Deliverables**:
- "Sync Now" button triggers BullMQ job
- User sees immediate feedback (no blocking)
- Worker processes job in background

**Exit Criteria**: End-to-end sync works from UI → Queue → Worker → DB

---

### Phase 7: Legacy Cleanup (T052-T056)

**Goal**: Remove all n8n-related code and configuration

**Tasks**:
1. Delete `app/api/policy-settings/route.ts` (n8n ingestion API)
2. Delete `app/api/admin/tenants/route.ts` (n8n polling API)
3. Remove `POLICY_API_SECRET` from `.env` and `lib/env.mjs`
4. Remove `N8N_SYNC_WEBHOOK_URL` from `.env` and `lib/env.mjs`
5. Grep search for remaining references (should be 0)
6. Update documentation (remove n8n setup instructions)

**Deliverables**:
- No n8n-related files in codebase
- No n8n-related env vars
- Clean grep search results

**Exit Criteria**: `grep -r "POLICY_API_SECRET\|N8N_SYNC_WEBHOOK_URL" .` returns 0 results (excluding specs/)

---

### Phase 8: Testing & Validation (T057-T061)

**Goal**: Comprehensive testing of new architecture

**Tasks**:
1. Unit tests for flattening logic
2. Integration tests for worker jobs
3. End-to-end test: UI → Queue → Worker → DB
4. Load test: 100+ policies sync
5. Error handling test: Graph API failures, Redis unavailable
6. Memory leak test: Worker runs 1+ hour with 10+ jobs

**Deliverables**:
- Test suite with >80% coverage for worker code
- All edge cases verified
- Performance benchmarks met (SC-001 to SC-008)

**Exit Criteria**: All tests pass, no regressions in existing features

---

### Phase 9: Deployment (T062-T066)

**Goal**: Deploy worker process to production

**Tasks**:
1. Ensure `REDIS_URL` is set in production (provided by Dokploy) — no Docker Compose Redis required
2. Configure worker as background service (PM2, Systemd, or Docker)
3. Set `REDIS_URL` in production environment
4. Monitor worker logs for first production sync
5. Verify sync completes successfully
6. Document worker deployment process

**Deliverables**:
- Worker running as persistent service
- Redis accessible from worker
- Production sync successful

**Exit Criteria**: Production sync works end-to-end, no errors in logs

---

## Key Technical Decisions

### 1. BullMQ vs. Other Queue Libraries

**Decision**: Use BullMQ

**Rationale**:
- Modern, actively maintained (vs. Kue, Bull)
- TypeScript-first design
- Built-in retry, rate limiting, priority queues
- Excellent documentation
- Redis-based (simpler than RabbitMQ/Kafka)

**Alternatives Considered**:
- **Bee-Queue**: Lighter but less features
- **Agenda**: MongoDB-based (adds extra dependency)
- **AWS SQS**: Vendor lock-in, requires AWS setup

---

### 2. Worker Process Architecture

**Decision**: Single worker process, sequential job processing (concurrency: 1)

**Rationale**:
- Simpler implementation (no race conditions)
- Microsoft Graph rate limits per tenant
- Database upsert logic easier without concurrency
- Can scale later if needed (multiple workers)

**Alternatives Considered**:
- **Parallel Processing**: Higher complexity, potential conflicts
- **Lambda/Serverless**: Cold starts, harder debugging

---

### 3. Token Acquisition Strategy

**Decision**: Use `@azure/identity` ClientSecretCredential

**Rationale**:
- Official Microsoft library
- Handles token refresh automatically
- TypeScript support
- Simpler than manual OAuth flow

**Alternatives Considered**:
- **Manual fetch()**: More code, error-prone
- **MSAL Node**: Overkill for server-side client credentials

---

### 4. Flattening Algorithm

**Decision**: Port n8n logic 1:1 initially, refactor later

**Rationale**:
- Minimize risk (proven logic)
- Faster migration (no re-design needed)
- Can optimize in Phase 2 after validation

**Alternatives Considered**:
- **Re-design from scratch**: Higher risk, longer timeline

---

### 5. Database Schema Changes

**Decision**: No schema changes needed

**Rationale**:
- Existing `policy_settings` table has required fields
- UNIQUE constraint already supports upsert logic
- `lastSyncedAt` field exists for tracking

**Alternatives Considered**:
- **Add job tracking table**: Overkill for MVP (BullMQ handles this)

---

## Data Flow Diagrams

### Current Flow (n8n)

```mermaid
sequenceDiagram
    participant User
    participant UI as Next.js UI
    participant SA as Server Action
    participant n8n as n8n Webhook
    participant API as Ingestion API
    participant DB as PostgreSQL

    User->>UI: Click "Sync Now"
    UI->>SA: triggerPolicySync(tenantId)
    SA->>n8n: POST /webhook
    n8n->>n8n: Fetch Graph API
    n8n->>n8n: Transform Data
    n8n->>API: POST /api/policy-settings
    API->>API: Validate API Secret
    API->>DB: Insert/Update
    DB-->>API: Success
    API-->>n8n: 200 OK
    n8n-->>SA: 200 OK
    SA-->>UI: Success
    UI-->>User: Toast "Sync started"
```

### Target Flow (BullMQ)

```mermaid
sequenceDiagram
    participant User
    participant UI as Next.js UI
    participant SA as Server Action
    participant Queue as Redis Queue
    participant Worker as Worker Process
    participant Graph as MS Graph API
    participant DB as PostgreSQL

    User->>UI: Click "Sync Now"
    UI->>SA: triggerPolicySync(tenantId)
    SA->>Queue: Add job "sync-tenant"
    Queue-->>SA: Job ID
    SA-->>UI: Success (immediate)
    UI-->>User: Toast "Sync started"

    Note over Worker: Background Processing
    Worker->>Queue: Pick job
    Worker->>Graph: Fetch policies
    Graph-->>Worker: Policy data
    Worker->>Worker: Transform data
    Worker->>DB: Upsert settings
    DB-->>Worker: Success
    Worker->>Queue: Mark job complete
```

---

## Environment Variables

### Changes Required

**Add**:
```bash
REDIS_URL=redis://localhost:6379
```

**Remove**:
```bash
# Delete these lines:
POLICY_API_SECRET=...
N8N_SYNC_WEBHOOK_URL=...
```

### Updated `lib/env.mjs`

```typescript
export const env = createEnv({
  server: {
    DATABASE_URL: z.string().url(),
    NEXTAUTH_SECRET: z.string().min(1),
    NEXTAUTH_URL: z.string().url(),
    AZURE_AD_CLIENT_ID: z.string().min(1),
    AZURE_AD_CLIENT_SECRET: z.string().min(1),
    REDIS_URL: z.string().url(), // ADD THIS
    RESEND_API_KEY: z.string().optional(),
    STRIPE_SECRET_KEY: z.string().optional(),
    // ... other Stripe vars
    // REMOVE: POLICY_API_SECRET
    // REMOVE: N8N_SYNC_WEBHOOK_URL
  },
  client: {},
  runtimeEnv: {
    DATABASE_URL: process.env.DATABASE_URL,
    NEXTAUTH_SECRET: process.env.NEXTAUTH_SECRET,
    NEXTAUTH_URL: process.env.NEXTAUTH_URL,
    AZURE_AD_CLIENT_ID: process.env.AZURE_AD_CLIENT_ID,
    AZURE_AD_CLIENT_SECRET: process.env.AZURE_AD_CLIENT_SECRET,
    REDIS_URL: process.env.REDIS_URL, // ADD THIS
    RESEND_API_KEY: process.env.RESEND_API_KEY,
    STRIPE_SECRET_KEY: process.env.STRIPE_SECRET_KEY,
    // ... other vars
  },
});
```

---

## Testing Strategy

### Unit Tests

**Target Coverage**: 80%+ for worker code

**Files to Test**:
- `worker/utils/humanizer.ts` - Setting ID transformation
- `worker/jobs/policyParser.ts` - Flattening logic
- `worker/utils/retry.ts` - Backoff algorithm

**Example**:
```typescript
describe('humanizeSettingId', () => {
  it('removes vendor prefix', () => {
    expect(humanizeSettingId('device_vendor_msft_policy_config_wifi'))
      .toBe('Wifi');
  });
});
```

---

### Integration Tests

**Target**: Full worker job processing

**Scenario**:
1. Mock Microsoft Graph API responses
2. Add job to queue
3. Verify worker processes job
4. Check database for inserted settings

**Example**:
```typescript
describe('syncPolicies', () => {
  it('fetches and stores policies', async () => {
    await syncPolicies('test-tenant-123');
    const settings = await db.query.policySettings.findMany({
      where: eq(policySettings.tenantId, 'test-tenant-123'),
    });
    expect(settings.length).toBeGreaterThan(0);
  });
});
```

---

### End-to-End Test

**Scenario**:
1. Start Redis + Worker
2. Login to UI
3. Navigate to `/search`
4. Click "Sync Now"
5. Verify:
   - Job created in Redis
   - Worker picks up job
   - Database updated
   - UI shows success message

---

## Rollback Plan

**If migration fails in production**:

1. **Immediate**: Revert to previous Docker image (with n8n integration)
2. **Restore env vars**: Re-add `POLICY_API_SECRET` and `N8N_SYNC_WEBHOOK_URL`
3. **Verify**: n8n webhook accessible, sync works
4. **Post-mortem**: Document failure reason, plan fixes

**Data Safety**: No data loss risk (upsert logic preserves existing data)

---

## Performance Targets

Based on Success Criteria (SC-001 to SC-008):

| Metric | Target | Measurement |
|--------|--------|-------------|
| Job Creation | <200ms | Server Action response time |
| Sync Duration (50 policies) | <30s | Worker job duration |
| Setting Extraction | >95% | Manual validation with sample data |
| Worker Stability | 1+ hour, 10+ jobs | Memory profiling |
| Pagination | 100% | Test with 100+ policies tenant |

---

## Dependencies

### npm Packages

```json
{
  "dependencies": {
    "bullmq": "^5.0.0",
    "ioredis": "^5.3.0",
    "@azure/identity": "^4.0.0"
  },
  "devDependencies": {
    "tsx": "^4.0.0"
  }
}
```

### Infrastructure

- **Redis**: 7.x (via Docker or external service)
- **Node.js**: 20+ (for worker process)

---

## Monitoring & Observability

### Worker Logs

**Format**: Structured JSON logs

**Key Events**:
- Job started: `{ event: "job_start", jobId, tenantId, timestamp }`
- Job completed: `{ event: "job_complete", jobId, duration, settingsCount }`
- Job failed: `{ event: "job_failed", jobId, error, stack }`

**Storage**: Write to file or stdout (captured by Docker/PM2)

---

### Health Check Endpoint

**Path**: `/api/worker-health`

**Response**:
```json
{
  "status": "healthy",
  "queue": {
    "waiting": 2,
    "active": 1,
    "completed": 45,
    "failed": 3
  }
}
```

**Use Case**: Monitoring dashboard, uptime checks

---

## Documentation Updates

**Files to Update**:
1. `README.md` - Add worker deployment instructions
2. `DEPLOYMENT.md` - Document Redis setup, worker config
3. `specs/002-manual-policy-sync/` - Mark as superseded by 005

**New Documentation**:
1. `docs/worker-deployment.md` - Step-by-step worker setup
2. `docs/troubleshooting.md` - Common worker issues & fixes

---

## Open Questions & Risks

### Q1: Redis Hosting Strategy

**Question**: Self-hosted Redis or managed service (e.g., Upstash, Redis Cloud)?

**Options**:
- Docker Compose (simple, dev-friendly)
- Upstash (serverless, paid but simple)
- Self-hosted on VPS (more control, more ops)

**Recommendation**: Start with Docker Compose, migrate to managed service if scaling needed

---

### Q2: Worker Deployment Method

**Question**: How to deploy worker in production?

**Options**:
- PM2 (Node process manager)
- Systemd (Linux service)
- Docker container (consistent with app)

**Recommendation**: Docker container (matches Next.js deployment strategy)

---

### Q3: Job Failure Notifications

**Question**: How to notify admins when sync jobs fail?

**Options**:
- Email via Resend (already integrated)
- In-app notification system (Phase 2)
- External monitoring (e.g., Sentry)

**Recommendation**: Start with logs only, add notifications in Phase 2

---

## Success Metrics

| Metric | Target | Status |
|--------|--------|--------|
| n8n dependency removed | Yes | 🔜 |
| All tests passing | 100% | 🔜 |
| Production sync successful | Yes | 🔜 |
| Worker uptime | >99% | 🔜 |
| Zero data loss | Yes | 🔜 |

---

## Timeline Estimate

| Phase | Duration | Dependencies |
|-------|----------|--------------|
| 0. Pre-Implementation | 1h | None |
| 1. Queue Infrastructure | 2h | Phase 0 |
| 2. Graph Integration | 4h | Phase 1 |
| 3. Data Transformation | 6h | Phase 2 |
| 4. Database Persistence | 3h | Phase 3 |
| 5. Frontend Integration | 2h | Phase 4 |
| 6. Legacy Cleanup | 2h | Phase 5 |
| 7. Testing & Validation | 4h | Phases 1-6 |
| 8. Deployment | 3h | Phase 7 |
| **Total** | **~27h** | **~3-4 days** |

---

## Next Steps

1. ✅ Generate `tasks.md` with detailed task breakdown
2. 🔜 Start Phase 0: Install Redis, update env vars
3. 🔜 Implement Phase 1: Queue infrastructure
4. 🔜 Continue through Phase 8: Deployment

---

**Plan Status**: ✅ Ready for Task Generation
**Approved by**: Technical Lead (pending)
**Last Updated**: 2025-12-09