Best Practices

AI Rules for Circuit Breaker Patterns

AI retries failed calls infinitely with no backoff, amplifying the problem. Rules for circuit breaker state machines, failure thresholds, fallback responses, health checks, and bulkhead isolation.

7 min read·March 24, 2025

Infinite retries to a down service — 2000 failed calls per second preventing recovery

Three-state machine, failure thresholds, graceful fallbacks, bulkhead isolation, opossum library

AI Retries Until Everything Is Down

AI generates external service calls with: infinite retries (the downstream service is down, but the caller keeps hammering it — preventing recovery), no backoff (retry immediately, thousands of times per second), no timeout (wait 30 seconds for a response that will never come — holding connections and threads), no fallback (if the service is down, show a 500 error to every user), and no isolation (one slow dependency makes the entire application slow). The pattern amplifies failures instead of containing them.

The circuit breaker pattern solves this with: a state machine (Closed → Open → Half-Open) that detects failure patterns and short-circuits calls to failing services, configurable thresholds (trip after 5 failures in 30 seconds), fast failure in Open state (return fallback immediately, do not call the failing service), health probes in Half-Open state (allow one test request to check if the service recovered), and fallback responses (cached data, default values, or degraded functionality instead of errors). AI generates none of these.

These rules cover: the three-state circuit breaker machine, failure threshold configuration, fallback response strategies, half-open health probes, bulkhead isolation, and practical opossum implementation.

Rule 1: Three-State Circuit Breaker Machine

The rule: 'Implement the circuit breaker as a state machine with three states. Closed (normal): requests flow through to the downstream service. Failures are counted. When failures exceed the threshold, transition to Open. Open (tripped): requests fail immediately with a fallback response — the downstream service is not called. After a cooldown period, transition to Half-Open. Half-Open (testing): allow one request through to the downstream service. If it succeeds: transition to Closed (service recovered). If it fails: transition back to Open (still down).'

For the state transitions: 'Closed → Open: failure count exceeds threshold (5 failures in 30 seconds). Open → Half-Open: cooldown timer expires (60 seconds). Half-Open → Closed: test request succeeds. Half-Open → Open: test request fails. Each transition is logged for observability: { event: "circuit_open", service: "payment-api", failures: 5, timestamp }. Alert on Open transitions — a circuit opening means a dependency is failing.'

AI generates: try { await callService(); } catch { await callService(); } — retry on failure with no state tracking. The service is down: every request tries, fails, retries, fails. 1000 concurrent requests = 2000 failed calls per second to an already-down service. Circuit breaker in Open state: 1000 requests return fallback in 1ms each, zero calls to the down service. The service gets zero load, time to recover.

  • Closed: normal flow, count failures, trip to Open on threshold
  • Open: fail fast with fallback, zero calls to downstream, cooldown timer
  • Half-Open: one test request, succeed → Closed, fail → Open again
  • Log all transitions: circuit_open, circuit_half_open, circuit_closed
  • Alert on Open: a tripped circuit = a failing dependency that needs attention
💡 Zero Calls to a Down Service

Without circuit breaker: 1000 requests = 2000 failed calls to an already-down service (preventing its recovery). Circuit Open: 1000 requests return fallback in 1ms each, zero calls to the failing service. The service gets zero load and time to recover.

Rule 2: Failure Threshold Configuration

The rule: 'Configure thresholds per dependency based on its characteristics. Fast services (payment API, auth service): trip after 5 failures in 30 seconds, 30-second cooldown. Slow services (report generation, ML inference): trip after 3 timeouts in 60 seconds, 120-second cooldown. Critical services (database): higher threshold (10 failures) with shorter cooldown (15 seconds) — the database recovers quickly but flapping is expensive. Non-critical services (analytics, logging): lower threshold (3 failures) with longer cooldown (300 seconds) — tolerate longer outages.'

For what counts as failure: 'Count as failures: HTTP 5xx responses, connection timeouts, connection refused, and response times exceeding the timeout threshold. Do not count: HTTP 4xx responses (client errors, not service failures), HTTP 429 (rate limited — handle with backoff, not circuit breaking), and expected errors (resource not found, validation failures). The circuit breaker monitors service health, not application logic errors.'

AI generates: one retry count for all services, no distinction between service types. The analytics service (non-critical) and the payment service (critical) have the same threshold. Analytics is slow: trips the circuit at the same rate as a payment failure. With per-dependency configuration: analytics tolerates longer outages (5-minute cooldown), payments recover faster (30-second cooldown). Each dependency gets a circuit calibrated to its behavior.

Rule 3: Graceful Fallback Responses

The rule: 'Every circuit breaker has a fallback that returns a degraded but functional response. Recommendation service down: show popular items instead of personalized recommendations. Search service down: show browse-by-category instead of search results. Payment service down: queue the order for processing when the service recovers (accept the order, process payment later). The fallback should be indistinguishable from a slightly degraded experience — the user should not see an error page.'

For fallback strategies: '(1) Cached response: return the last successful response from cache (stale data is better than no data). (2) Default values: return sensible defaults (empty array for recommendations, zero for analytics counters). (3) Degraded functionality: disable the affected feature, keep everything else working (hide the search bar, show navigation instead). (4) Queue for later: accept the request, process when the service recovers (orders, emails, notifications). Choose by impact: cached for read operations, queue for write operations, degrade for optional features.'

AI generates: catch (error) { res.status(500).json({ error: 'Service unavailable' }); } — every downstream failure becomes a 500 for the user. The recommendation service is down: the entire product page shows a 500 error (even though the product data, pricing, and images are all available). With a fallback: the page loads with "Popular items" instead of "Recommended for you." 99% of the page works. The 1% that depends on the failing service degrades gracefully.

  • Cached response: last successful result, stale but functional
  • Default values: empty array, zero counter, generic content
  • Degraded functionality: hide affected feature, keep everything else working
  • Queue for later: accept writes, process when service recovers
  • User sees degraded experience, not error page — the circuit breaker is invisible
⚠️ 99% of the Page Still Works

Recommendation service down: AI shows a 500 error for the entire product page. With fallback: the page loads with 'Popular items' instead of 'Recommended for you.' Product data, pricing, images — all working. The 1% that failed degrades gracefully.

Rule 4: Bulkhead Isolation Between Dependencies

The rule: 'Isolate each dependency with its own circuit breaker and connection pool (bulkhead pattern). Without isolation: one slow dependency exhausts all available connections/threads. The payment service is slow: all 100 connection pool connections are waiting on payment responses. No connections available for the fast user service or product service — everything is slow. Bulkhead: payment service gets 30 connections, user service gets 30, product service gets 30. Payment is slow: only payment calls are affected.'

For the naming convention: 'Name circuit breakers by dependency: const paymentCircuit = new CircuitBreaker(callPaymentAPI, { name: "payment-api" }); const userCircuit = new CircuitBreaker(callUserAPI, { name: "user-api" }). Each circuit tracks its own failure count, has its own state, and trips independently. Payment circuit opens: payment calls return fallback. User circuit stays closed: user calls work normally. The failure of one dependency does not cascade to others.'

AI generates: one circuit breaker wrapping all external calls, or no isolation at all. One slow service: all circuits trip (false positive — the user service is healthy but the shared circuit is open). Or: all connections exhausted by the slow service (everything is slow). Per-dependency bulkheads: failures are isolated. The slow payment service affects only payment operations. Everything else operates normally.

ℹ️ Slow Payment, Fast Everything Else

Without bulkheads: slow payment service exhausts all 100 connections. User service and product service have no connections available — everything is slow. Per-dependency bulkheads: payment gets 30 connections. User and product each get 30. Only payment calls are affected.

Rule 5: Practical opossum Implementation

The rule: 'Use opossum for circuit breakers in Node.js: const circuit = new CircuitBreaker(asyncFunction, { timeout: 3000, errorThresholdPercentage: 50, resetTimeout: 30000, volumeThreshold: 5 }). Options: timeout (max wait for response), errorThresholdPercentage (50% failures trip the circuit), resetTimeout (30 seconds before Half-Open), volumeThreshold (minimum 5 requests before evaluating). Events: circuit.on("open", () => alert("Payment API circuit opened")); circuit.on("halfOpen", () => log("Testing payment API")).'

For fallback registration: 'circuit.fallback(() => ({ recommendations: popularItems, source: "fallback" })). The fallback function runs when: the circuit is Open (fast failure), the call times out (timeout exceeded), or the call throws an error (while the circuit is still Closed but the individual call failed). The fallback return value should match the expected response shape — the caller does not know whether it received a real response or a fallback. Add a source: "fallback" field for observability.'

AI generates: manual try/catch with retry counters, setTimeout for cooldown, and global variables tracking failure state — 50 lines of fragile state management. opossum: 5 lines of configuration, battle-tested state machine, event emitters for observability, built-in metrics (Prometheus-compatible), and fallback registration. The library handles the state machine complexity; your code provides the configuration and fallback.

Complete Circuit Breaker Rules Template

Consolidated rules for circuit breaker patterns.

  • Three states: Closed (normal), Open (fail fast + fallback), Half-Open (test one request)
  • Per-dependency thresholds: fast services (5 failures/30s), slow services (3 timeouts/60s)
  • Count 5xx and timeouts as failures: not 4xx, not 429, not business logic errors
  • Fallback for every circuit: cached, defaults, degraded, or queued — never 500 error page
  • Bulkhead isolation: separate circuit + connection pool per dependency
  • opossum library: 5 lines of config, state machine, events, Prometheus metrics
  • Alert on circuit open: a tripped circuit means a dependency needs attention
  • Log transitions: circuit_open, circuit_half_open, circuit_closed with dependency name