22 KiB
Implementation Plan: Backend Architecture Pivot
Feature Branch: 005-backend-arch-pivot
Created: 2025-12-09
Spec: spec.md
Status: Ready for Implementation
Executive Summary
Goal: Migrate from n8n Low-Code backend to TypeScript Code-First backend with BullMQ job queue for Policy synchronization.
Impact: Removes external n8n dependency, improves maintainability, enables AI-assisted refactoring, and provides foundation for future scheduled sync features.
Complexity: HIGH - Requires new infrastructure (Redis, BullMQ), worker process deployment, and careful data transformation logic porting.
Technical Context
Current Architecture (n8n-based)
User clicks "Sync Now"
↓
Server Action: triggerPolicySync()
↓
HTTP POST → n8n Webhook (N8N_SYNC_WEBHOOK_URL)
↓
n8n Workflow:
1. Microsoft Graph Authentication
2. Fetch Policies (4 endpoints with pagination)
3. JavaScript Code Node: Deep Flattening Logic
4. HTTP POST → TenantPilot Ingestion API
↓
API Route: /api/policy-settings (validates POLICY_API_SECRET)
↓
Drizzle ORM: Insert/Update policy_settings table
Problems:
- External dependency (n8n instance required)
- Complex transformation logic hidden in n8n Code Node
- Hard to test, version control, and refactor
- No AI assistance for n8n code
- Additional API security layer needed (POLICY_API_SECRET)
Target Architecture (BullMQ-based)
User clicks "Sync Now"
↓
Server Action: triggerPolicySync()
↓
BullMQ: Add job to Redis queue "intune-sync-queue"
↓
Worker Process (TypeScript):
1. Microsoft Graph Authentication (@azure/identity)
2. Fetch Policies (4 endpoints with pagination)
3. TypeScript: Deep Flattening Logic
4. Drizzle ORM: Direct Insert/Update
↓
Database: policy_settings table
Benefits:
- No external dependencies (Redis only)
- All logic in TypeScript (version-controlled, testable)
- AI-assisted refactoring possible
- Simpler security model (no API bridge)
- Foundation for scheduled syncs
Constitution Check (mandatory)
Compliance Verification
| Principle | Status | Notes |
|---|---|---|
| I. Server-First Architecture | ✅ COMPLIANT | Worker uses Server Actions pattern (background job processing), no client fetches |
| II. TypeScript Strict Mode | ✅ COMPLIANT | All worker code in TypeScript strict mode, fully typed Graph API responses |
| III. Drizzle ORM Integration | ✅ COMPLIANT | Worker uses Drizzle for all DB operations, no raw SQL |
| IV. Shadcn UI Components | ✅ COMPLIANT | No UI changes (frontend only triggers job, uses existing components) |
| V. Azure AD Multi-Tenancy | ✅ COMPLIANT | Uses existing Azure AD Client Credentials for Graph API access |
Risk Assessment
HIGH RISK: Worker deployment as separate process (requires Docker Compose update, PM2/Systemd config)
MEDIUM RISK: Graph API rate limiting handling (needs robust retry logic)
LOW RISK: BullMQ integration (well-documented library, standard Redis setup)
Justification
Architecture pivot necessary to:
- Remove external n8n dependency (reduces operational complexity)
- Enable AI-assisted development (TypeScript vs. n8n visual flows)
- Improve testability (unit/integration tests for worker logic)
- Prepare for Phase 2 features (scheduled syncs, multi-tenant parallel processing)
Approved: Constitution compliance verified, complexity justified by maintainability gains.
File Tree & Changes
tenantpilot/
├── .env # [MODIFIED] Add REDIS_URL, remove POLICY_API_SECRET + N8N_SYNC_WEBHOOK_URL
├── (Redis provided by deployment) # No `docker-compose.yml` required; ensure `REDIS_URL` is set by Dokploy
├── package.json # [MODIFIED] Add bullmq, ioredis, @azure/identity, tsx dependencies
│
├── lib/
│ ├── env.mjs # [MODIFIED] Add REDIS_URL validation, remove POLICY_API_SECRET + N8N_SYNC_WEBHOOK_URL
│ ├── queue/
│ │ ├── redis.ts # [NEW] Redis connection for BullMQ
│ │ └── syncQueue.ts # [NEW] BullMQ Queue definition for "intune-sync-queue"
│ └── actions/
│ └── policySettings.ts # [MODIFIED] Replace n8n webhook call with BullMQ job creation
│
├── worker/
│ ├── index.ts # [NEW] BullMQ Worker entry point
│ ├── jobs/
│ │ ├── syncPolicies.ts # [NEW] Main sync orchestration logic
│ │ ├── graphAuth.ts # [NEW] Azure AD token acquisition
│ │ ├── graphFetch.ts # [NEW] Microsoft Graph API calls with pagination
│ │ ├── policyParser.ts # [NEW] Deep flattening & transformation logic
│ │ └── dbUpsert.ts # [NEW] Drizzle ORM upsert operations
│ └── utils/
│ ├── humanizer.ts # [NEW] Setting ID humanization
│ └── retry.ts # [NEW] Exponential backoff retry logic
│
├── app/api/
│ ├── policy-settings/
│ │ └── route.ts # [DELETED] n8n ingestion API no longer needed
│ └── admin/
│ └── tenants/
│ └── route.ts # [DELETED] n8n polling API no longer needed
│
└── specs/005-backend-arch-pivot/
├── spec.md # ✅ Complete
├── plan.md # 📝 This file
├── technical-notes.md # ✅ Complete (implementation reference)
└── tasks.md # 🔜 Generated next
Phase Breakdown
Phase 1: Setup & Infrastructure (T001-T008)
Goal: Prepare environment, install dependencies, and wire the app to the provisioned Redis instance
Tasks:
- T001: Confirm
REDIS_URLis provided by Dokploy and obtain connection details - T002-T004: Add
REDIS_URLto local.env(for development) and tolib/env.mjsruntime validation - T005: Install npm packages:
bullmq,ioredis,@azure/identity,tsx - T006-T007: Create Redis connection and BullMQ Queue
- T008: Test infrastructure (connect to provided Redis from local/dev environment)
Deliverables:
- Connection details for Redis from Dokploy documented
- Environment variables validated (local + deploy)
- Dependencies in
package.json - Queue operational using the provided Redis
Exit Criteria: npm run dev starts without env validation errors and the queue accepts jobs against the provided Redis
Phase 2: Worker Process Skeleton (T009-T014)ntry point and basic job processing infrastructure
Tasks:
- T009: Create
worker/index.ts- BullMQ Worker entry point - T010-T012: Add npm script, event handlers, structured logging
- T013: Create sync orchestration skeleton
- T014: Test worker startup
Deliverables:
- Worker process can be started via
npm run worker:start - Jobs flow from queue → worker
- Event logging operational
Exit Criteria: Worker logs "Processing job X" when job is added to queue
Phase 3: Microsoft Graph Integration (T015-T023)ion and Microsoft Graph API data fetching with pagination
Tasks:
- T015-T017: Create
worker/jobs/graphAuth.ts- Azure AD token acquisition - T018-T021: Create
worker/jobs/graphFetch.ts- Fetch from 4 endpoints with pagination - T022: Create
worker/utils/retry.ts- Exponential backoff - T023: Test with real tenant data
Deliverables:
getGraphAccessToken()returns valid tokenfetchAllPolicies()returns all policies from 4 endpoints- Pagination handled correctly (follows
@odata.nextLink) - Rate limiting handled with retry
Exit Criteria: Worker successfully fetches >50 policies for test tenant
Phase 4: Data Transformation (T024-T035)
Goal: Port n8n flattening logic to TypeScript
Tasks:
- Create
worker/jobs/policyParser.ts- Policy type detection & routing - Implement Settings Catalog parser (
settings[]→ flat key-value) - Implement OMA-URI parser (
omaSettings[]→ flat key-value) - Create
worker/utils/humanizer.ts- Setting ID humanization - Handle empty policies (default placeholder setting)
- Test: Parse sample policies, verify output structure
Deliverables:
parsePolicySettings()converts Graph response → FlattenedSetting[]- Humanizer converts technical IDs → readable names
- Empty policies get "(No settings configured)" entry
Exit Criteria: 95%+ of sample settings are correctly extracted and formatted
Phase 5: Database Persistence (T036-T043)
Goal: Implement Drizzle ORM upsert logic
Tasks:
- Create
worker/jobs/dbUpsert.ts- Batch upsert with conflict resolution - Use existing
policy_settingstable schema - Leverage
policy_settings_upsert_uniqueconstraint (tenantId + graphPolicyId + settingName) - Update
lastSyncedAton every sync - Test: Run full sync, verify data in DB
Deliverables:
upsertPolicySettings()inserts new & updates existing settings- No duplicate settings created
lastSyncedAtupdated correctly
Exit Criteria: Full sync for test tenant completes successfully, data visible in DB
Phase 6: Frontend Integration (T044-T051)
Goal: Replace n8n webhook with BullMQ job creation
Tasks:
- Modify
lib/actions/policySettings.ts→triggerPolicySync() - Remove n8n webhook call (
fetch(env.N8N_SYNC_WEBHOOK_URL)) - Replace with BullMQ job creation (
syncQueue.add(...)) - Return job ID to frontend
- Test: Click "Sync Now", verify job created & processed
Deliverables:
- "Sync Now" button triggers BullMQ job
- User sees immediate feedback (no blocking)
- Worker processes job in background
Exit Criteria: End-to-end sync works from UI → Queue → Worker → DB
Phase 7: Legacy Cleanup (T052-T056)
Goal: Remove all n8n-related code and configuration
Tasks:
- Delete
app/api/policy-settings/route.ts(n8n ingestion API) - Delete
app/api/admin/tenants/route.ts(n8n polling API) - Remove
POLICY_API_SECRETfrom.envandlib/env.mjs - Remove
N8N_SYNC_WEBHOOK_URLfrom.envandlib/env.mjs - Grep search for remaining references (should be 0)
- Update documentation (remove n8n setup instructions)
Deliverables:
- No n8n-related files in codebase
- No n8n-related env vars
- Clean grep search results
Exit Criteria: grep -r "POLICY_API_SECRET\|N8N_SYNC_WEBHOOK_URL" . returns 0 results (excluding specs/)
Phase 8: Testing & Validation (T057-T061)
Goal: Comprehensive testing of new architecture
Tasks:
- Unit tests for flattening logic
- Integration tests for worker jobs
- End-to-end test: UI → Queue → Worker → DB
- Load test: 100+ policies sync
- Error handling test: Graph API failures, Redis unavailable
- Memory leak test: Worker runs 1+ hour with 10+ jobs
Deliverables:
- Test suite with >80% coverage for worker code
- All edge cases verified
- Performance benchmarks met (SC-001 to SC-008)
Exit Criteria: All tests pass, no regressions in existing features
Phase 9: Deployment (T062-T066)
Goal: Deploy worker process to production
Tasks:
- Ensure
REDIS_URLis set in production (provided by Dokploy) — no Docker Compose Redis required - Configure worker as background service (PM2, Systemd, or Docker)
- Set
REDIS_URLin production environment - Monitor worker logs for first production sync
- Verify sync completes successfully
- Document worker deployment process
Deliverables:
- Worker running as persistent service
- Redis accessible from worker
- Production sync successful
Exit Criteria: Production sync works end-to-end, no errors in logs
Key Technical Decisions
1. BullMQ vs. Other Queue Libraries
Decision: Use BullMQ
Rationale:
- Modern, actively maintained (vs. Kue, Bull)
- TypeScript-first design
- Built-in retry, rate limiting, priority queues
- Excellent documentation
- Redis-based (simpler than RabbitMQ/Kafka)
Alternatives Considered:
- Bee-Queue: Lighter but less features
- Agenda: MongoDB-based (adds extra dependency)
- AWS SQS: Vendor lock-in, requires AWS setup
2. Worker Process Architecture
Decision: Single worker process, sequential job processing (concurrency: 1)
Rationale:
- Simpler implementation (no race conditions)
- Microsoft Graph rate limits per tenant
- Database upsert logic easier without concurrency
- Can scale later if needed (multiple workers)
Alternatives Considered:
- Parallel Processing: Higher complexity, potential conflicts
- Lambda/Serverless: Cold starts, harder debugging
3. Token Acquisition Strategy
Decision: Use @azure/identity ClientSecretCredential
Rationale:
- Official Microsoft library
- Handles token refresh automatically
- TypeScript support
- Simpler than manual OAuth flow
Alternatives Considered:
- Manual fetch(): More code, error-prone
- MSAL Node: Overkill for server-side client credentials
4. Flattening Algorithm
Decision: Port n8n logic 1:1 initially, refactor later
Rationale:
- Minimize risk (proven logic)
- Faster migration (no re-design needed)
- Can optimize in Phase 2 after validation
Alternatives Considered:
- Re-design from scratch: Higher risk, longer timeline
5. Database Schema Changes
Decision: No schema changes needed
Rationale:
- Existing
policy_settingstable has required fields - UNIQUE constraint already supports upsert logic
lastSyncedAtfield exists for tracking
Alternatives Considered:
- Add job tracking table: Overkill for MVP (BullMQ handles this)
Data Flow Diagrams
Current Flow (n8n)
sequenceDiagram
participant User
participant UI as Next.js UI
participant SA as Server Action
participant n8n as n8n Webhook
participant API as Ingestion API
participant DB as PostgreSQL
User->>UI: Click "Sync Now"
UI->>SA: triggerPolicySync(tenantId)
SA->>n8n: POST /webhook
n8n->>n8n: Fetch Graph API
n8n->>n8n: Transform Data
n8n->>API: POST /api/policy-settings
API->>API: Validate API Secret
API->>DB: Insert/Update
DB-->>API: Success
API-->>n8n: 200 OK
n8n-->>SA: 200 OK
SA-->>UI: Success
UI-->>User: Toast "Sync started"
Target Flow (BullMQ)
sequenceDiagram
participant User
participant UI as Next.js UI
participant SA as Server Action
participant Queue as Redis Queue
participant Worker as Worker Process
participant Graph as MS Graph API
participant DB as PostgreSQL
User->>UI: Click "Sync Now"
UI->>SA: triggerPolicySync(tenantId)
SA->>Queue: Add job "sync-tenant"
Queue-->>SA: Job ID
SA-->>UI: Success (immediate)
UI-->>User: Toast "Sync started"
Note over Worker: Background Processing
Worker->>Queue: Pick job
Worker->>Graph: Fetch policies
Graph-->>Worker: Policy data
Worker->>Worker: Transform data
Worker->>DB: Upsert settings
DB-->>Worker: Success
Worker->>Queue: Mark job complete
Environment Variables
Changes Required
Add:
REDIS_URL=redis://localhost:6379
Remove:
# Delete these lines:
POLICY_API_SECRET=...
N8N_SYNC_WEBHOOK_URL=...
Updated lib/env.mjs
export const env = createEnv({
server: {
DATABASE_URL: z.string().url(),
NEXTAUTH_SECRET: z.string().min(1),
NEXTAUTH_URL: z.string().url(),
AZURE_AD_CLIENT_ID: z.string().min(1),
AZURE_AD_CLIENT_SECRET: z.string().min(1),
REDIS_URL: z.string().url(), // ADD THIS
RESEND_API_KEY: z.string().optional(),
STRIPE_SECRET_KEY: z.string().optional(),
// ... other Stripe vars
// REMOVE: POLICY_API_SECRET
// REMOVE: N8N_SYNC_WEBHOOK_URL
},
client: {},
runtimeEnv: {
DATABASE_URL: process.env.DATABASE_URL,
NEXTAUTH_SECRET: process.env.NEXTAUTH_SECRET,
NEXTAUTH_URL: process.env.NEXTAUTH_URL,
AZURE_AD_CLIENT_ID: process.env.AZURE_AD_CLIENT_ID,
AZURE_AD_CLIENT_SECRET: process.env.AZURE_AD_CLIENT_SECRET,
REDIS_URL: process.env.REDIS_URL, // ADD THIS
RESEND_API_KEY: process.env.RESEND_API_KEY,
STRIPE_SECRET_KEY: process.env.STRIPE_SECRET_KEY,
// ... other vars
},
});
Testing Strategy
Unit Tests
Target Coverage: 80%+ for worker code
Files to Test:
worker/utils/humanizer.ts- Setting ID transformationworker/jobs/policyParser.ts- Flattening logicworker/utils/retry.ts- Backoff algorithm
Example:
describe('humanizeSettingId', () => {
it('removes vendor prefix', () => {
expect(humanizeSettingId('device_vendor_msft_policy_config_wifi'))
.toBe('Wifi');
});
});
Integration Tests
Target: Full worker job processing
Scenario:
- Mock Microsoft Graph API responses
- Add job to queue
- Verify worker processes job
- Check database for inserted settings
Example:
describe('syncPolicies', () => {
it('fetches and stores policies', async () => {
await syncPolicies('test-tenant-123');
const settings = await db.query.policySettings.findMany({
where: eq(policySettings.tenantId, 'test-tenant-123'),
});
expect(settings.length).toBeGreaterThan(0);
});
});
End-to-End Test
Scenario:
- Start Redis + Worker
- Login to UI
- Navigate to
/search - Click "Sync Now"
- Verify:
- Job created in Redis
- Worker picks up job
- Database updated
- UI shows success message
Rollback Plan
If migration fails in production:
- Immediate: Revert to previous Docker image (with n8n integration)
- Restore env vars: Re-add
POLICY_API_SECRETandN8N_SYNC_WEBHOOK_URL - Verify: n8n webhook accessible, sync works
- Post-mortem: Document failure reason, plan fixes
Data Safety: No data loss risk (upsert logic preserves existing data)
Performance Targets
Based on Success Criteria (SC-001 to SC-008):
| Metric | Target | Measurement |
|---|---|---|
| Job Creation | <200ms | Server Action response time |
| Sync Duration (50 policies) | <30s | Worker job duration |
| Setting Extraction | >95% | Manual validation with sample data |
| Worker Stability | 1+ hour, 10+ jobs | Memory profiling |
| Pagination | 100% | Test with 100+ policies tenant |
Dependencies
npm Packages
{
"dependencies": {
"bullmq": "^5.0.0",
"ioredis": "^5.3.0",
"@azure/identity": "^4.0.0"
},
"devDependencies": {
"tsx": "^4.0.0"
}
}
Infrastructure
- Redis: 7.x (via Docker or external service)
- Node.js: 20+ (for worker process)
Monitoring & Observability
Worker Logs
Format: Structured JSON logs
Key Events:
- Job started:
{ event: "job_start", jobId, tenantId, timestamp } - Job completed:
{ event: "job_complete", jobId, duration, settingsCount } - Job failed:
{ event: "job_failed", jobId, error, stack }
Storage: Write to file or stdout (captured by Docker/PM2)
Health Check Endpoint
Path: /api/worker-health
Response:
{
"status": "healthy",
"queue": {
"waiting": 2,
"active": 1,
"completed": 45,
"failed": 3
}
}
Use Case: Monitoring dashboard, uptime checks
Documentation Updates
Files to Update:
README.md- Add worker deployment instructionsDEPLOYMENT.md- Document Redis setup, worker configspecs/002-manual-policy-sync/- Mark as superseded by 005
New Documentation:
docs/worker-deployment.md- Step-by-step worker setupdocs/troubleshooting.md- Common worker issues & fixes
Open Questions & Risks
Q1: Redis Hosting Strategy
Question: Self-hosted Redis or managed service (e.g., Upstash, Redis Cloud)?
Options:
- Docker Compose (simple, dev-friendly)
- Upstash (serverless, paid but simple)
- Self-hosted on VPS (more control, more ops)
Recommendation: Start with Docker Compose, migrate to managed service if scaling needed
Q2: Worker Deployment Method
Question: How to deploy worker in production?
Options:
- PM2 (Node process manager)
- Systemd (Linux service)
- Docker container (consistent with app)
Recommendation: Docker container (matches Next.js deployment strategy)
Q3: Job Failure Notifications
Question: How to notify admins when sync jobs fail?
Options:
- Email via Resend (already integrated)
- In-app notification system (Phase 2)
- External monitoring (e.g., Sentry)
Recommendation: Start with logs only, add notifications in Phase 2
Success Metrics
| Metric | Target | Status |
|---|---|---|
| n8n dependency removed | Yes | 🔜 |
| All tests passing | 100% | 🔜 |
| Production sync successful | Yes | 🔜 |
| Worker uptime | >99% | 🔜 |
| Zero data loss | Yes | 🔜 |
Timeline Estimate
| Phase | Duration | Dependencies |
|---|---|---|
| 0. Pre-Implementation | 1h | None |
| 1. Queue Infrastructure | 2h | Phase 0 |
| 2. Graph Integration | 4h | Phase 1 |
| 3. Data Transformation | 6h | Phase 2 |
| 4. Database Persistence | 3h | Phase 3 |
| 5. Frontend Integration | 2h | Phase 4 |
| 6. Legacy Cleanup | 2h | Phase 5 |
| 7. Testing & Validation | 4h | Phases 1-6 |
| 8. Deployment | 3h | Phase 7 |
| Total | ~27h | ~3-4 days |
Next Steps
- ✅ Generate
tasks.mdwith detailed task breakdown - 🔜 Start Phase 0: Install Redis, update env vars
- 🔜 Implement Phase 1: Queue infrastructure
- 🔜 Continue through Phase 8: Deployment
Plan Status: ✅ Ready for Task Generation
Approved by: Technical Lead (pending)
Last Updated: 2025-12-09