AI Does Everything Synchronously — Users Wait
AI generates request handlers that: send emails (2-5 seconds), process images (3-10 seconds), call external APIs (1-5 seconds), generate PDFs (5-15 seconds), and run data transformations (variable) — all synchronously while the user waits. The response arrives 10-30 seconds later, if it arrives at all (most clients timeout at 30 seconds). Every one of these operations belongs in a background job queue.
The rule of thumb: if an operation takes more than 200 milliseconds and is not critical for the immediate response, queue it. The user does not need to wait for the welcome email to send, the avatar to resize, or the analytics to process. They need the response: account created, file uploaded, order placed. The rest happens in the background.
These rules cover: when to use background jobs, job design patterns, retry strategies, dead letter queues, and worker scaling. They apply to any queue system: BullMQ (Node.js + Redis), Celery (Python + Redis/RabbitMQ), Sidekiq (Ruby + Redis), or cloud queues (SQS, Cloud Tasks).
Rule 1: When to Use Background Jobs
The rule: 'Queue any operation that: takes >200ms (email, image processing, PDF generation), calls an external service (payment processing, third-party APIs), can fail independently of the user action (analytics, notifications, sync), or processes large data (CSV import, report generation, bulk operations). The request handler: validates input, persists the core data, enqueues the side effects, and returns immediately.'
For the pattern: 'POST /api/orders: validate order data → save order to DB (status: pending) → enqueue: send confirmation email, process payment, update inventory → return { orderId, status: pending }. Each side effect is a separate job — if email fails, payment still processes. If payment fails, the order status updates to failed. The user gets the orderId immediately.'
AI generates: const order = await db.orders.create(data); await sendEmail(user.email, orderConfirmation); await processPayment(order); await updateInventory(order.items); return res.json(order). Four synchronous calls — if any fails, the entire request fails. If all succeed, the response takes 10+ seconds. Queue and decouple.
- >200ms operation → queue it — email, image, PDF, external API, bulk data
- Request handler: validate → persist core data → enqueue side effects → respond
- Each side effect is a separate job — independent failure and retry
- User gets immediate response — side effects process in background
- If unsure: will the user notice a 1-second delay? No → queue it
If an operation takes >200ms and the user does not need the result immediately: queue it. Email, image resize, PDF generation, external API calls — all belong in a background job, not in the request handler.
Rule 2: Idempotent, Serializable Job Design
The rule: 'Jobs must be idempotent: processing a job twice produces the same result as processing it once. Jobs must be serializable: the job payload is JSON (no functions, no database connections, no file handles). Include everything the worker needs in the payload: IDs for database lookups, not entire objects. Workers are stateless — they receive a payload, do work, and complete.'
For payload design: '{ type: "send-welcome-email", userId: "abc-123", orderId: "ord-456" } — not { type: "send-email", user: { name: "Alice", email: "alice@example.com", ... } }. The worker looks up the user by ID — getting the current data, not stale data from when the job was created. If the user changed their email between queueing and processing, the worker sends to the updated address.'
For idempotency: 'Before sending the email: check if already sent (has_welcome_email flag or sent_emails table). Before processing payment: check if already processed (payment status). Before updating inventory: check if already deducted (inventory transaction log). The check-then-act pattern makes every job safe to retry — duplicates are no-ops.'
Rule 3: Retry with Exponential Backoff
The rule: 'Retry failed jobs with exponential backoff: attempt 1 immediately, attempt 2 after 10 seconds, attempt 3 after 1 minute, attempt 4 after 10 minutes, attempt 5 after 1 hour. Maximum 5 retries. After max retries: move to dead letter queue (DLQ) for manual inspection. Never retry immediately in a loop — it overwhelms the failing service and does not give transient errors time to resolve.'
For BullMQ: 'new Queue("email", { defaultJobOptions: { attempts: 5, backoff: { type: "exponential", delay: 10000 } } }). BullMQ handles the retry schedule automatically. For Celery: @app.task(bind=True, max_retries=5, default_retry_delay=10). For SQS: configure visibility timeout and redrive policy.'
For non-retryable errors: 'Some errors should not be retried: validation errors (the data is wrong — retrying is pointless), 404 (the resource does not exist), and authorization failures (credentials are wrong). Check the error type before retrying: if (err instanceof ValidationError) throw err; // no retry. Only retry transient errors: timeouts, rate limits, server errors.'
- Exponential backoff: 10s → 1m → 10m → 1h — give transient errors time to resolve
- Max 5 retries — then dead letter queue for manual inspection
- Never immediate retry loop — overwhelms the failing service
- Retry transient: timeout, rate limit, 5xx — skip permanent: validation, 404, auth
- BullMQ: attempts + backoff config — Celery: max_retries + retry_delay
Retrying immediately in a loop overwhelms the failing service. Exponential backoff: 10s → 1m → 10m → 1h gives transient errors time to resolve. If the third-party API is rate-limiting you, hammering it faster makes it worse.
Rule 4: Dead Letter Queues and Failure Handling
The rule: 'After max retries, move failed jobs to a dead letter queue (DLQ) — never silently discard. The DLQ stores: the original job payload, the error message, the number of attempts, and the failure timestamp. Monitor the DLQ: alert when it has items. Someone should investigate: is the failure transient (requeue after fixing the issue) or permanent (the job needs manual resolution)?'
For the DLQ pattern: 'BullMQ: configure removeOnFail: false + listen to "failed" event after max retries → move to DLQ queue. SQS: redrive policy automatically moves to DLQ after maxReceiveCount. Celery: use on_failure callback to log to a failed_jobs table. Every queue system supports DLQ — use it.'
AI silently discards failed jobs — no DLQ, no logging, no alerting. A failed job might be: a payment that was charged but not recorded, a user who signed up but never got the welcome email, or an order that was placed but inventory was never deducted. DLQ ensures failed jobs are visible and actionable.
A silently discarded failed job might be: a payment charged but not recorded, a user without a welcome email, or an order with wrong inventory. DLQ makes failures visible — someone investigates, fixes, and requeues. No silent data loss.
Rule 5: Worker Scaling and Monitoring
The rule: 'Run workers as separate processes — not inside the web server. Scale workers independently: if the queue is backing up, add more workers. If the queue is empty, scale down. Use concurrency settings: BullMQ worker concurrency (how many jobs process simultaneously), Celery worker prefetch_multiplier, or Sidekiq concurrency. Start with concurrency matching available CPU cores.'
For monitoring: 'Monitor: queue depth (jobs waiting), processing rate (jobs/second), average processing time, failure rate, and DLQ size. Alert on: queue depth growing (workers not keeping up), failure rate spike (something broke), and DLQ items (needs investigation). Use queue dashboards: BullMQ Board, Flower (Celery), Sidekiq Web UI.'
For priority queues: 'Use separate queues for different priorities: high (payment processing, critical notifications), default (emails, analytics), and low (report generation, data sync). Process high-priority first. Use rate limiting per queue to prevent one queue from starving others. Never mix critical and non-critical jobs in one queue — a backed-up email queue should not delay payment processing.'
- Workers as separate processes — not inside the web server
- Scale independently: more workers when queue backs up — fewer when empty
- Concurrency: match CPU cores — tune based on I/O vs CPU workload
- Priority queues: high (payments), default (email), low (reports) — separate
- Monitor: queue depth, processing rate, failure rate, DLQ size — dashboard + alerts
Complete Background Jobs Rules Template
Consolidated rules for background job processing.
- Queue operations >200ms: email, image, PDF, external API, bulk — never synchronous
- Request handler: validate → persist → enqueue → respond immediately
- Idempotent jobs: check-then-act — safe to retry — payload contains IDs not objects
- Exponential backoff: 10s → 1m → 10m → 1h — max 5 retries — then DLQ
- Dead letter queue: never discard failed jobs — store, alert, investigate, requeue
- Workers as separate processes — scale independently — concurrency per CPU
- Priority queues: high/default/low — critical jobs never blocked by non-critical
- Monitor: depth, rate, failures, DLQ — BullMQ Board / Flower / Sidekiq Web