AI Rules for Event-Driven Architecture

AI Chains Everything Synchronously

AI generates workflows as: synchronous chains (create order, charge payment, send email, update inventory — all in one request handler), tightly coupled (the order service directly calls the payment service, which calls the email service), failure-cascading (if email sending fails, the entire order fails), blocking (the user waits while the server sends emails and generates PDFs), and unrecoverable (a failed step has no retry mechanism and no record of what succeeded). One slow or failing downstream service blocks the entire user experience.

Modern event-driven architecture is: asynchronous (publish events, consumers process independently), decoupled (services communicate via events, not direct calls), failure-isolated (email failure does not block order completion), non-blocking (user gets immediate response, background processing continues), and recoverable (failed events go to dead letter queues for retry or investigation). AI generates none of these.

These rules cover: typed event schemas, message queue patterns, saga orchestration for multi-step workflows, dead letter queues for failure handling, idempotent consumers, and eventual consistency strategies.

Rule 1: Typed Event Schemas

The rule: 'Define every event with a typed schema: type OrderCreatedEvent = { type: "order.created"; id: string; timestamp: string; data: { orderId: string; userId: string; items: OrderItem[]; total: number; }; metadata: { correlationId: string; source: string; version: number; }; }. The schema is: versioned (version field for backward compatibility), correlated (correlationId traces events through the system), and validated (Zod schema validates events at publish and consume boundaries).'

For event naming: 'Use past-tense domain events: order.created, payment.charged, email.sent, inventory.reserved. Past tense signals that the event is a fact — something that already happened. Commands (create.order, charge.payment) are requests that may fail. Events are immutable facts. Namespace by domain: order.*, payment.*, user.* — clear ownership, easy filtering, and topic-based routing.'

AI generates: queue.push({ type: 'order', data: order }) — no schema, no version, no correlation ID, no timestamp. Consumers parse whatever arrives and hope the shape matches. A typed schema: producers and consumers agree on the contract. Schema validation at the boundary catches mismatches before they cause runtime errors. The schema is the API contract for asynchronous communication.

Past-tense naming: order.created, payment.charged — events are facts, not commands
Versioned: version field for backward-compatible evolution
Correlated: correlationId traces a workflow across services
Validated: Zod schema at publish and consume boundaries
Namespaced: order.*, payment.* — domain ownership, topic routing

Rule 2: Message Queue Patterns

The rule: 'Use message queues for asynchronous processing: SQS (AWS), BullMQ (Redis-backed, Node.js), Inngest (serverless event-driven), or RabbitMQ (self-hosted). The pattern: producer publishes an event to a queue, consumer processes the event independently. The producer does not wait for the consumer — the user gets an immediate response. Multiple consumers can process different aspects of the same event (fan-out).'

For fan-out: 'One event, multiple consumers: order.created triggers: (1) payment service reserves funds, (2) inventory service reserves items, (3) notification service sends confirmation email, (4) analytics service records the order. Each consumer subscribes to order.created independently. Adding a new consumer (loyalty points service) requires zero changes to the producer. The producer publishes the event; it does not know or care who consumes it.'

AI generates: async function createOrder() { await chargePayment(); await reserveInventory(); await sendEmail(); await updateAnalytics(); return order; } — four synchronous calls in one handler. If analytics is slow (2 seconds), the user waits 2 seconds. If email fails, the order fails. With events: publish order.created, return immediately. Four consumers process in parallel, independently, with their own retry logic.

💡 Add Consumers, Change Nothing

Fan-out: order.created triggers payment, inventory, email, and analytics consumers independently. Adding a loyalty points consumer requires zero changes to the order service. The producer publishes the event; it does not know or care who consumes it.

Rule 3: Saga Pattern for Distributed Transactions

The rule: 'For multi-step workflows that span services, use the saga pattern instead of distributed transactions. A saga is a sequence of local transactions with compensating actions: (1) reserve inventory (compensate: release inventory), (2) charge payment (compensate: refund payment), (3) ship order (compensate: cancel shipment). If step 2 fails, execute compensating action for step 1 (release inventory). Each step is a separate event — the saga orchestrator tracks progress.'

For orchestration vs choreography: 'Orchestration: a central saga orchestrator directs the workflow — sends commands, receives results, decides next steps. Good for complex workflows with conditional logic. Choreography: each service listens for events and knows what to do next — no central coordinator. Good for simple linear workflows. Use orchestration when: steps have conditional branches, error handling is complex, or you need a clear picture of workflow state.'

AI generates: a single database transaction wrapping calls to three services — BEGIN, call payment API, call inventory API, call email API, COMMIT. If the payment API call succeeds but the inventory API fails, the payment is committed but inventory is not reserved. Distributed transactions across services are unreliable. Sagas with compensating actions: each step is independently committed, failures trigger compensation. Eventually consistent, but reliable.

⚠️ Distributed Transactions Are Unreliable

BEGIN, call payment API, call inventory API, COMMIT — if payment succeeds but inventory fails, payment is committed without reserved inventory. Sagas with compensating actions: each step independently committed, failures trigger rollback of previous steps. Eventually consistent but reliable.

Rule 4: Dead Letter Queues and Retry Strategies

The rule: 'Configure dead letter queues (DLQ) for events that fail after maximum retries. Retry strategy: (1) immediate retry for transient errors (network blip), (2) exponential backoff for service errors (1s, 2s, 4s, 8s, 16s), (3) maximum retry count (5 attempts typically), (4) DLQ for permanently failed events. The DLQ is: monitored (alert on DLQ depth), inspectable (read failed events to understand why), and replayable (fix the bug, then replay events from the DLQ).'

For error classification: 'Transient errors (network timeout, 503) — retry immediately, likely to succeed. Service errors (500, connection refused) — retry with backoff, give the service time to recover. Permanent errors (400, invalid data, business rule violation) — send to DLQ immediately, retrying will not help. The consumer should classify errors and route accordingly. Retrying a 400 Bad Request wastes resources; sending a 503 to the DLQ loses a recoverable event.'

AI generates: no retry logic, no DLQ. A failed event disappears. The order was created, the payment was charged, but the confirmation email was never sent — and nobody knows. With a DLQ: the failed email event sits in the DLQ, an alert fires, the team investigates, fixes the issue, and replays the event. The customer gets their confirmation. Nothing is lost.

DLQ for events that fail after max retries — never lose events silently
Exponential backoff: 1s, 2s, 4s, 8s, 16s — give failing services time to recover
Classify errors: transient (retry now), service (retry with backoff), permanent (DLQ)
Monitor DLQ depth — alert on growth, investigate patterns
Replayable DLQ: fix the bug, replay failed events — nothing is lost

ℹ️ Nothing Is Lost

Without a DLQ: failed email event disappears silently. Customer never gets their order confirmation, nobody knows. With a DLQ: the failed event sits there, an alert fires, the team fixes the issue, replays the event. Customer gets their confirmation. Zero events lost.

Rule 5: Eventual Consistency Handling

The rule: 'In event-driven systems, data is eventually consistent — not immediately consistent. After order.created is published, the inventory service may take 500ms to process it. During that window, the inventory count is stale. Design for this: (1) UI shows optimistic updates ("Order placed" immediately, not after all services confirm), (2) read models are eventually consistent (accept staleness for reads, enforce consistency for writes), (3) conflict detection (version numbers or timestamps detect stale updates).'

For user experience: 'The user does not need immediate consistency — they need immediate feedback. "Your order has been placed" is shown immediately (order.created published). The confirmation email arrives in 5 seconds (email consumer processed). The inventory dashboard updates in 2 seconds (inventory consumer processed). The user perceives a fast, responsive system. Behind the scenes, eventual consistency is converging to a consistent state.'

AI generates: synchronous consistency — the user waits for every service to confirm before seeing a response. 5 services, each 200ms = 1 second of waiting. With eventual consistency: immediate response (50ms), background convergence (5 services process in parallel, all done in 500ms). The user gets a 20x faster response; the system reaches consistency 500ms later. For most use cases, this is the right trade-off.

Complete Event-Driven Architecture Rules Template

Consolidated rules for event-driven architecture.

Typed event schemas: past-tense naming, versioned, correlated, Zod-validated
Message queues for async: SQS, BullMQ, Inngest — producer does not wait for consumer
Fan-out: one event, multiple independent consumers — add consumers without changing producer
Saga pattern for multi-step workflows — compensating actions instead of distributed transactions
Dead letter queues: never lose events — monitor, inspect, replay after fixing
Retry with error classification: transient (immediate), service (backoff), permanent (DLQ)
Eventual consistency: optimistic UI, fast response, background convergence
Idempotent consumers: safe to process the same event twice — deduplication by event ID