tenantpilot/specs/005-backend-arch-pivot/plan.md

22 KiB

Implementation Plan: Backend Architecture Pivot

Feature Branch: 005-backend-arch-pivot
Created: 2025-12-09
Spec: spec.md
Status: Ready for Implementation


Executive Summary

Goal: Migrate from n8n Low-Code backend to TypeScript Code-First backend with BullMQ job queue for Policy synchronization.

Impact: Removes external n8n dependency, improves maintainability, enables AI-assisted refactoring, and provides foundation for future scheduled sync features.

Complexity: HIGH - Requires new infrastructure (Redis, BullMQ), worker process deployment, and careful data transformation logic porting.


Technical Context

Current Architecture (n8n-based)

User clicks "Sync Now"
    ↓
Server Action: triggerPolicySync()
    ↓
HTTP POST → n8n Webhook (N8N_SYNC_WEBHOOK_URL)
    ↓
n8n Workflow:
  1. Microsoft Graph Authentication
  2. Fetch Policies (4 endpoints with pagination)
  3. JavaScript Code Node: Deep Flattening Logic
  4. HTTP POST → TenantPilot Ingestion API
    ↓
API Route: /api/policy-settings (validates POLICY_API_SECRET)
    ↓
Drizzle ORM: Insert/Update policy_settings table

Problems:

  • External dependency (n8n instance required)
  • Complex transformation logic hidden in n8n Code Node
  • Hard to test, version control, and refactor
  • No AI assistance for n8n code
  • Additional API security layer needed (POLICY_API_SECRET)

Target Architecture (BullMQ-based)

User clicks "Sync Now"
    ↓
Server Action: triggerPolicySync()
    ↓
BullMQ: Add job to Redis queue "intune-sync-queue"
    ↓
Worker Process (TypeScript):
  1. Microsoft Graph Authentication (@azure/identity)
  2. Fetch Policies (4 endpoints with pagination)
  3. TypeScript: Deep Flattening Logic
  4. Drizzle ORM: Direct Insert/Update
    ↓
Database: policy_settings table

Benefits:

  • No external dependencies (Redis only)
  • All logic in TypeScript (version-controlled, testable)
  • AI-assisted refactoring possible
  • Simpler security model (no API bridge)
  • Foundation for scheduled syncs

Constitution Check (mandatory)

Compliance Verification

Principle Status Notes
I. Server-First Architecture COMPLIANT Worker uses Server Actions pattern (background job processing), no client fetches
II. TypeScript Strict Mode COMPLIANT All worker code in TypeScript strict mode, fully typed Graph API responses
III. Drizzle ORM Integration COMPLIANT Worker uses Drizzle for all DB operations, no raw SQL
IV. Shadcn UI Components COMPLIANT No UI changes (frontend only triggers job, uses existing components)
V. Azure AD Multi-Tenancy COMPLIANT Uses existing Azure AD Client Credentials for Graph API access

Risk Assessment

HIGH RISK: Worker deployment as separate process (requires Docker Compose update, PM2/Systemd config)

MEDIUM RISK: Graph API rate limiting handling (needs robust retry logic)

LOW RISK: BullMQ integration (well-documented library, standard Redis setup)

Justification

Architecture pivot necessary to:

  1. Remove external n8n dependency (reduces operational complexity)
  2. Enable AI-assisted development (TypeScript vs. n8n visual flows)
  3. Improve testability (unit/integration tests for worker logic)
  4. Prepare for Phase 2 features (scheduled syncs, multi-tenant parallel processing)

Approved: Constitution compliance verified, complexity justified by maintainability gains.


File Tree & Changes

tenantpilot/
├── .env                             # [MODIFIED] Add REDIS_URL, remove POLICY_API_SECRET + N8N_SYNC_WEBHOOK_URL
├── (Redis provided by deployment)   # No `docker-compose.yml` required; ensure `REDIS_URL` is set by Dokploy
├── package.json                     # [MODIFIED] Add bullmq, ioredis, @azure/identity, tsx dependencies
│
├── lib/
│   ├── env.mjs                      # [MODIFIED] Add REDIS_URL validation, remove POLICY_API_SECRET + N8N_SYNC_WEBHOOK_URL
│   ├── queue/
│   │   ├── redis.ts                 # [NEW] Redis connection for BullMQ
│   │   └── syncQueue.ts             # [NEW] BullMQ Queue definition for "intune-sync-queue"
│   └── actions/
│       └── policySettings.ts        # [MODIFIED] Replace n8n webhook call with BullMQ job creation
│
├── worker/
│   ├── index.ts                     # [NEW] BullMQ Worker entry point
│   ├── jobs/
│   │   ├── syncPolicies.ts          # [NEW] Main sync orchestration logic
│   │   ├── graphAuth.ts             # [NEW] Azure AD token acquisition
│   │   ├── graphFetch.ts            # [NEW] Microsoft Graph API calls with pagination
│   │   ├── policyParser.ts          # [NEW] Deep flattening & transformation logic
│   │   └── dbUpsert.ts              # [NEW] Drizzle ORM upsert operations
│   └── utils/
│       ├── humanizer.ts             # [NEW] Setting ID humanization
│       └── retry.ts                 # [NEW] Exponential backoff retry logic
│
├── app/api/
│   ├── policy-settings/
│   │   └── route.ts                 # [DELETED] n8n ingestion API no longer needed
│   └── admin/
│       └── tenants/
│           └── route.ts             # [DELETED] n8n polling API no longer needed
│
└── specs/005-backend-arch-pivot/
    ├── spec.md                      # ✅ Complete
    ├── plan.md                      # 📝 This file
    ├── technical-notes.md           # ✅ Complete (implementation reference)
    └── tasks.md                     # 🔜 Generated next

Phase Breakdown

Phase 1: Setup & Infrastructure (T001-T008)

Goal: Prepare environment, install dependencies, and wire the app to the provisioned Redis instance

Tasks:

  • T001: Confirm REDIS_URL is provided by Dokploy and obtain connection details
  • T002-T004: Add REDIS_URL to local .env (for development) and to lib/env.mjs runtime validation
  • T005: Install npm packages: bullmq, ioredis, @azure/identity, tsx
  • T006-T007: Create Redis connection and BullMQ Queue
  • T008: Test infrastructure (connect to provided Redis from local/dev environment)

Deliverables:

  • Connection details for Redis from Dokploy documented
  • Environment variables validated (local + deploy)
  • Dependencies in package.json
  • Queue operational using the provided Redis

Exit Criteria: npm run dev starts without env validation errors and the queue accepts jobs against the provided Redis


Phase 2: Worker Process Skeleton (T009-T014)ntry point and basic job processing infrastructure

Tasks:

  • T009: Create worker/index.ts - BullMQ Worker entry point
  • T010-T012: Add npm script, event handlers, structured logging
  • T013: Create sync orchestration skeleton
  • T014: Test worker startup

Deliverables:

  • Worker process can be started via npm run worker:start
  • Jobs flow from queue → worker
  • Event logging operational

Exit Criteria: Worker logs "Processing job X" when job is added to queue


Phase 3: Microsoft Graph Integration (T015-T023)ion and Microsoft Graph API data fetching with pagination

Tasks:

  • T015-T017: Create worker/jobs/graphAuth.ts - Azure AD token acquisition
  • T018-T021: Create worker/jobs/graphFetch.ts - Fetch from 4 endpoints with pagination
  • T022: Create worker/utils/retry.ts - Exponential backoff
  • T023: Test with real tenant data

Deliverables:

  • getGraphAccessToken() returns valid token
  • fetchAllPolicies() returns all policies from 4 endpoints
  • Pagination handled correctly (follows @odata.nextLink)
  • Rate limiting handled with retry

Exit Criteria: Worker successfully fetches >50 policies for test tenant


Phase 4: Data Transformation (T024-T035)

Goal: Port n8n flattening logic to TypeScript

Tasks:

  1. Create worker/jobs/policyParser.ts - Policy type detection & routing
  2. Implement Settings Catalog parser (settings[] → flat key-value)
  3. Implement OMA-URI parser (omaSettings[] → flat key-value)
  4. Create worker/utils/humanizer.ts - Setting ID humanization
  5. Handle empty policies (default placeholder setting)
  6. Test: Parse sample policies, verify output structure

Deliverables:

  • parsePolicySettings() converts Graph response → FlattenedSetting[]
  • Humanizer converts technical IDs → readable names
  • Empty policies get "(No settings configured)" entry

Exit Criteria: 95%+ of sample settings are correctly extracted and formatted


Phase 5: Database Persistence (T036-T043)

Goal: Implement Drizzle ORM upsert logic

Tasks:

  1. Create worker/jobs/dbUpsert.ts - Batch upsert with conflict resolution
  2. Use existing policy_settings table schema
  3. Leverage policy_settings_upsert_unique constraint (tenantId + graphPolicyId + settingName)
  4. Update lastSyncedAt on every sync
  5. Test: Run full sync, verify data in DB

Deliverables:

  • upsertPolicySettings() inserts new & updates existing settings
  • No duplicate settings created
  • lastSyncedAt updated correctly

Exit Criteria: Full sync for test tenant completes successfully, data visible in DB


Phase 6: Frontend Integration (T044-T051)

Goal: Replace n8n webhook with BullMQ job creation

Tasks:

  1. Modify lib/actions/policySettings.tstriggerPolicySync()
  2. Remove n8n webhook call (fetch(env.N8N_SYNC_WEBHOOK_URL))
  3. Replace with BullMQ job creation (syncQueue.add(...))
  4. Return job ID to frontend
  5. Test: Click "Sync Now", verify job created & processed

Deliverables:

  • "Sync Now" button triggers BullMQ job
  • User sees immediate feedback (no blocking)
  • Worker processes job in background

Exit Criteria: End-to-end sync works from UI → Queue → Worker → DB


Phase 7: Legacy Cleanup (T052-T056)

Goal: Remove all n8n-related code and configuration

Tasks:

  1. Delete app/api/policy-settings/route.ts (n8n ingestion API)
  2. Delete app/api/admin/tenants/route.ts (n8n polling API)
  3. Remove POLICY_API_SECRET from .env and lib/env.mjs
  4. Remove N8N_SYNC_WEBHOOK_URL from .env and lib/env.mjs
  5. Grep search for remaining references (should be 0)
  6. Update documentation (remove n8n setup instructions)

Deliverables:

  • No n8n-related files in codebase
  • No n8n-related env vars
  • Clean grep search results

Exit Criteria: grep -r "POLICY_API_SECRET\|N8N_SYNC_WEBHOOK_URL" . returns 0 results (excluding specs/)


Phase 8: Testing & Validation (T057-T061)

Goal: Comprehensive testing of new architecture

Tasks:

  1. Unit tests for flattening logic
  2. Integration tests for worker jobs
  3. End-to-end test: UI → Queue → Worker → DB
  4. Load test: 100+ policies sync
  5. Error handling test: Graph API failures, Redis unavailable
  6. Memory leak test: Worker runs 1+ hour with 10+ jobs

Deliverables:

  • Test suite with >80% coverage for worker code
  • All edge cases verified
  • Performance benchmarks met (SC-001 to SC-008)

Exit Criteria: All tests pass, no regressions in existing features


Phase 9: Deployment (T062-T066)

Goal: Deploy worker process to production

Tasks:

  1. Ensure REDIS_URL is set in production (provided by Dokploy) — no Docker Compose Redis required
  2. Configure worker as background service (PM2, Systemd, or Docker)
  3. Set REDIS_URL in production environment
  4. Monitor worker logs for first production sync
  5. Verify sync completes successfully
  6. Document worker deployment process

Deliverables:

  • Worker running as persistent service
  • Redis accessible from worker
  • Production sync successful

Exit Criteria: Production sync works end-to-end, no errors in logs


Key Technical Decisions

1. BullMQ vs. Other Queue Libraries

Decision: Use BullMQ

Rationale:

  • Modern, actively maintained (vs. Kue, Bull)
  • TypeScript-first design
  • Built-in retry, rate limiting, priority queues
  • Excellent documentation
  • Redis-based (simpler than RabbitMQ/Kafka)

Alternatives Considered:

  • Bee-Queue: Lighter but less features
  • Agenda: MongoDB-based (adds extra dependency)
  • AWS SQS: Vendor lock-in, requires AWS setup

2. Worker Process Architecture

Decision: Single worker process, sequential job processing (concurrency: 1)

Rationale:

  • Simpler implementation (no race conditions)
  • Microsoft Graph rate limits per tenant
  • Database upsert logic easier without concurrency
  • Can scale later if needed (multiple workers)

Alternatives Considered:

  • Parallel Processing: Higher complexity, potential conflicts
  • Lambda/Serverless: Cold starts, harder debugging

3. Token Acquisition Strategy

Decision: Use @azure/identity ClientSecretCredential

Rationale:

  • Official Microsoft library
  • Handles token refresh automatically
  • TypeScript support
  • Simpler than manual OAuth flow

Alternatives Considered:

  • Manual fetch(): More code, error-prone
  • MSAL Node: Overkill for server-side client credentials

4. Flattening Algorithm

Decision: Port n8n logic 1:1 initially, refactor later

Rationale:

  • Minimize risk (proven logic)
  • Faster migration (no re-design needed)
  • Can optimize in Phase 2 after validation

Alternatives Considered:

  • Re-design from scratch: Higher risk, longer timeline

5. Database Schema Changes

Decision: No schema changes needed

Rationale:

  • Existing policy_settings table has required fields
  • UNIQUE constraint already supports upsert logic
  • lastSyncedAt field exists for tracking

Alternatives Considered:

  • Add job tracking table: Overkill for MVP (BullMQ handles this)

Data Flow Diagrams

Current Flow (n8n)

sequenceDiagram
    participant User
    participant UI as Next.js UI
    participant SA as Server Action
    participant n8n as n8n Webhook
    participant API as Ingestion API
    participant DB as PostgreSQL

    User->>UI: Click "Sync Now"
    UI->>SA: triggerPolicySync(tenantId)
    SA->>n8n: POST /webhook
    n8n->>n8n: Fetch Graph API
    n8n->>n8n: Transform Data
    n8n->>API: POST /api/policy-settings
    API->>API: Validate API Secret
    API->>DB: Insert/Update
    DB-->>API: Success
    API-->>n8n: 200 OK
    n8n-->>SA: 200 OK
    SA-->>UI: Success
    UI-->>User: Toast "Sync started"

Target Flow (BullMQ)

sequenceDiagram
    participant User
    participant UI as Next.js UI
    participant SA as Server Action
    participant Queue as Redis Queue
    participant Worker as Worker Process
    participant Graph as MS Graph API
    participant DB as PostgreSQL

    User->>UI: Click "Sync Now"
    UI->>SA: triggerPolicySync(tenantId)
    SA->>Queue: Add job "sync-tenant"
    Queue-->>SA: Job ID
    SA-->>UI: Success (immediate)
    UI-->>User: Toast "Sync started"
    
    Note over Worker: Background Processing
    Worker->>Queue: Pick job
    Worker->>Graph: Fetch policies
    Graph-->>Worker: Policy data
    Worker->>Worker: Transform data
    Worker->>DB: Upsert settings
    DB-->>Worker: Success
    Worker->>Queue: Mark job complete

Environment Variables

Changes Required

Add:

REDIS_URL=redis://localhost:6379

Remove:

# Delete these lines:
POLICY_API_SECRET=...
N8N_SYNC_WEBHOOK_URL=...

Updated lib/env.mjs

export const env = createEnv({
  server: {
    DATABASE_URL: z.string().url(),
    NEXTAUTH_SECRET: z.string().min(1),
    NEXTAUTH_URL: z.string().url(),
    AZURE_AD_CLIENT_ID: z.string().min(1),
    AZURE_AD_CLIENT_SECRET: z.string().min(1),
    REDIS_URL: z.string().url(), // ADD THIS
    RESEND_API_KEY: z.string().optional(),
    STRIPE_SECRET_KEY: z.string().optional(),
    // ... other Stripe vars
    // REMOVE: POLICY_API_SECRET
    // REMOVE: N8N_SYNC_WEBHOOK_URL
  },
  client: {},
  runtimeEnv: {
    DATABASE_URL: process.env.DATABASE_URL,
    NEXTAUTH_SECRET: process.env.NEXTAUTH_SECRET,
    NEXTAUTH_URL: process.env.NEXTAUTH_URL,
    AZURE_AD_CLIENT_ID: process.env.AZURE_AD_CLIENT_ID,
    AZURE_AD_CLIENT_SECRET: process.env.AZURE_AD_CLIENT_SECRET,
    REDIS_URL: process.env.REDIS_URL, // ADD THIS
    RESEND_API_KEY: process.env.RESEND_API_KEY,
    STRIPE_SECRET_KEY: process.env.STRIPE_SECRET_KEY,
    // ... other vars
  },
});

Testing Strategy

Unit Tests

Target Coverage: 80%+ for worker code

Files to Test:

  • worker/utils/humanizer.ts - Setting ID transformation
  • worker/jobs/policyParser.ts - Flattening logic
  • worker/utils/retry.ts - Backoff algorithm

Example:

describe('humanizeSettingId', () => {
  it('removes vendor prefix', () => {
    expect(humanizeSettingId('device_vendor_msft_policy_config_wifi'))
      .toBe('Wifi');
  });
});

Integration Tests

Target: Full worker job processing

Scenario:

  1. Mock Microsoft Graph API responses
  2. Add job to queue
  3. Verify worker processes job
  4. Check database for inserted settings

Example:

describe('syncPolicies', () => {
  it('fetches and stores policies', async () => {
    await syncPolicies('test-tenant-123');
    const settings = await db.query.policySettings.findMany({
      where: eq(policySettings.tenantId, 'test-tenant-123'),
    });
    expect(settings.length).toBeGreaterThan(0);
  });
});

End-to-End Test

Scenario:

  1. Start Redis + Worker
  2. Login to UI
  3. Navigate to /search
  4. Click "Sync Now"
  5. Verify:
    • Job created in Redis
    • Worker picks up job
    • Database updated
    • UI shows success message

Rollback Plan

If migration fails in production:

  1. Immediate: Revert to previous Docker image (with n8n integration)
  2. Restore env vars: Re-add POLICY_API_SECRET and N8N_SYNC_WEBHOOK_URL
  3. Verify: n8n webhook accessible, sync works
  4. Post-mortem: Document failure reason, plan fixes

Data Safety: No data loss risk (upsert logic preserves existing data)


Performance Targets

Based on Success Criteria (SC-001 to SC-008):

Metric Target Measurement
Job Creation <200ms Server Action response time
Sync Duration (50 policies) <30s Worker job duration
Setting Extraction >95% Manual validation with sample data
Worker Stability 1+ hour, 10+ jobs Memory profiling
Pagination 100% Test with 100+ policies tenant

Dependencies

npm Packages

{
  "dependencies": {
    "bullmq": "^5.0.0",
    "ioredis": "^5.3.0",
    "@azure/identity": "^4.0.0"
  },
  "devDependencies": {
    "tsx": "^4.0.0"
  }
}

Infrastructure

  • Redis: 7.x (via Docker or external service)
  • Node.js: 20+ (for worker process)

Monitoring & Observability

Worker Logs

Format: Structured JSON logs

Key Events:

  • Job started: { event: "job_start", jobId, tenantId, timestamp }
  • Job completed: { event: "job_complete", jobId, duration, settingsCount }
  • Job failed: { event: "job_failed", jobId, error, stack }

Storage: Write to file or stdout (captured by Docker/PM2)


Health Check Endpoint

Path: /api/worker-health

Response:

{
  "status": "healthy",
  "queue": {
    "waiting": 2,
    "active": 1,
    "completed": 45,
    "failed": 3
  }
}

Use Case: Monitoring dashboard, uptime checks


Documentation Updates

Files to Update:

  1. README.md - Add worker deployment instructions
  2. DEPLOYMENT.md - Document Redis setup, worker config
  3. specs/002-manual-policy-sync/ - Mark as superseded by 005

New Documentation:

  1. docs/worker-deployment.md - Step-by-step worker setup
  2. docs/troubleshooting.md - Common worker issues & fixes

Open Questions & Risks

Q1: Redis Hosting Strategy

Question: Self-hosted Redis or managed service (e.g., Upstash, Redis Cloud)?

Options:

  • Docker Compose (simple, dev-friendly)
  • Upstash (serverless, paid but simple)
  • Self-hosted on VPS (more control, more ops)

Recommendation: Start with Docker Compose, migrate to managed service if scaling needed


Q2: Worker Deployment Method

Question: How to deploy worker in production?

Options:

  • PM2 (Node process manager)
  • Systemd (Linux service)
  • Docker container (consistent with app)

Recommendation: Docker container (matches Next.js deployment strategy)


Q3: Job Failure Notifications

Question: How to notify admins when sync jobs fail?

Options:

  • Email via Resend (already integrated)
  • In-app notification system (Phase 2)
  • External monitoring (e.g., Sentry)

Recommendation: Start with logs only, add notifications in Phase 2


Success Metrics

Metric Target Status
n8n dependency removed Yes 🔜
All tests passing 100% 🔜
Production sync successful Yes 🔜
Worker uptime >99% 🔜
Zero data loss Yes 🔜

Timeline Estimate

Phase Duration Dependencies
0. Pre-Implementation 1h None
1. Queue Infrastructure 2h Phase 0
2. Graph Integration 4h Phase 1
3. Data Transformation 6h Phase 2
4. Database Persistence 3h Phase 3
5. Frontend Integration 2h Phase 4
6. Legacy Cleanup 2h Phase 5
7. Testing & Validation 4h Phases 1-6
8. Deployment 3h Phase 7
Total ~27h ~3-4 days

Next Steps

  1. Generate tasks.md with detailed task breakdown
  2. 🔜 Start Phase 0: Install Redis, update env vars
  3. 🔜 Implement Phase 1: Queue infrastructure
  4. 🔜 Continue through Phase 8: Deployment

Plan Status: Ready for Task Generation
Approved by: Technical Lead (pending)
Last Updated: 2025-12-09