AI Rules for Retry Patterns

AI Retries Like a Broken Record

AI generates retry logic with: immediate retries (no delay between attempts — hammering the failing service), infinite retries (never gives up, blocks the caller forever), no jitter (1000 clients all retry at exactly the same time — thundering herd), no error classification (retries a 400 Bad Request that will never succeed), and no idempotency check (retrying a non-idempotent POST creates duplicate resources). Every retry pattern AI generates either amplifies the failure or creates new problems.

Modern retry patterns are: backoff-delayed (exponential delay: 1s, 2s, 4s, 8s — giving the service time to recover), jitter-distributed (random variance prevents thundering herd), max-limited (3-5 attempts, then fail with a clear error), error-classified (retry 503 Service Unavailable, do not retry 400 Bad Request), and idempotency-verified (only retry operations that are safe to repeat). AI generates none of these.

These rules cover: exponential backoff calculation, jitter strategies, max retry limits, error classification for retryability, idempotency requirements, and retry budgets for system-wide control.

Rule 1: Exponential Backoff with Jitter

The rule: 'Calculate retry delay with exponential backoff plus random jitter: delay = min(baseDelay * 2^attempt + random(0, baseDelay), maxDelay). Base delay: 1 second. Attempt 1: 1-2s. Attempt 2: 2-3s. Attempt 3: 4-5s. Attempt 4: 8-9s. Max delay cap: 30 seconds (prevent absurdly long waits). The exponential growth gives the failing service progressively more recovery time. The jitter prevents all clients from retrying simultaneously.'

For jitter strategies: 'Full jitter: delay = random(0, baseDelay * 2^attempt). Widest distribution, best thundering herd prevention. Equal jitter: delay = baseDelay * 2^attempt / 2 + random(0, baseDelay * 2^attempt / 2). Half deterministic, half random — good balance. Decorrelated jitter: delay = min(maxDelay, random(baseDelay, previousDelay * 3)). Each delay based on the previous — naturally decorrelates retries across clients. Full jitter is the recommended default (AWS recommends it).'

AI generates: await new Promise(r => setTimeout(r, 1000)) — fixed 1-second delay for every retry. 1000 clients fail at the same time: all retry after exactly 1 second. The service gets a spike of 1000 requests — fails again. All retry after 1 second again. Thundering herd in a loop. With jitter: 1000 clients retry across a 1-2 second window (randomly distributed). The service receives a smooth trickle instead of a spike. Same retry count, dramatically different load pattern.

Exponential: baseDelay * 2^attempt — 1s, 2s, 4s, 8s, 16s progression
Jitter: add random(0, baseDelay) to prevent synchronized retries
Max delay cap: 30 seconds — prevent absurdly long waits
Full jitter (AWS recommended): random(0, baseDelay * 2^attempt)
1000 clients with jitter: smooth trickle. Without jitter: thundering herd spike

💡 Smooth Trickle vs Thundering Herd

1000 clients with fixed 1-second delay: all retry at the same instant = spike. 1000 clients with full jitter: retries distributed randomly across a 1-2 second window = smooth trickle. Same retry count, dramatically different server load pattern.

Rule 2: Max Retry Limits

The rule: 'Set a maximum retry count: 3-5 attempts for user-facing operations (the user should not wait more than 30 seconds), 5-10 attempts for background jobs (longer tolerance, but still bounded), and 0 retries for non-idempotent operations unless idempotency is guaranteed. After max retries: return a clear error to the caller, log the failure with context (which service, which operation, how many attempts, total elapsed time), and optionally queue for later retry (dead letter queue).'

For total time budget: 'Instead of max attempts, consider a total time budget: retry for up to 30 seconds, regardless of attempt count. With exponential backoff: 1s + 2s + 4s + 8s + 16s = 31 seconds = 5 attempts. This naturally caps both the attempt count and the total wait time. Implementation: const deadline = Date.now() + 30000; while (Date.now() < deadline) { try { return await call(); } catch { await backoff(attempt++); } }.'

AI generates: while (true) { try { return await call(); } catch { await sleep(1000); } } — infinite retries with fixed delay. The downstream service is permanently down (not a transient failure): the caller retries forever, holding a connection, a thread, and the user attention indefinitely. Max retries: after 5 attempts, return an error. The user sees a failure message and can take action. The system resources are freed.

Rule 3: Error Classification for Retryability

The rule: 'Classify every error as retryable or permanent before deciding to retry. Retryable: 503 Service Unavailable (server is overloaded, may recover), 429 Too Many Requests (rate limited, retry after Retry-After header), connection timeout (network blip), connection reset (transient network issue), ECONNREFUSED (service restarting). Permanent: 400 Bad Request (invalid input, retrying with same input produces same error), 401 Unauthorized (invalid credentials), 403 Forbidden (no permission), 404 Not Found (resource does not exist), 409 Conflict (state conflict, retrying worsens it).'

For the Retry-After header: 'When the server returns 429 with Retry-After: 5, wait exactly 5 seconds before retrying (not your exponential backoff — the server told you when to retry). Retry-After can be: seconds (Retry-After: 5) or an HTTP date (Retry-After: Thu, 29 Mar 2026 15:00:00 GMT). Always check for and respect Retry-After before applying your own backoff. The server knows its capacity better than your retry algorithm.'

AI generates: catch (error) { return retry(); } — retry on every error regardless of type. A 400 Bad Request with invalid JSON: retried 5 times, fails 5 times (the input is wrong, not the service). A 404 Not Found: retried 5 times, resource still does not exist. Wasted 5 attempts and 15 seconds on errors that can never succeed. Error classification: retry the 503, do not retry the 400. Save retries for errors that can actually recover.

Retryable: 503, 429 (with Retry-After), timeouts, connection reset, ECONNREFUSED
Permanent: 400, 401, 403, 404, 409, 422 — retrying will not fix these
Respect Retry-After header: server-specified wait time overrides your backoff
Classify before retrying: saves attempts, time, and resources on permanent errors
Unknown errors: treat as retryable for the first attempt, permanent after that

⚠️ Retrying a 400 Five Times

400 Bad Request: the input is wrong, not the service. Retrying 5 times wastes 5 attempts and 15 seconds on an error that can never succeed. Error classification: retry the 503 (may recover), fail immediately on the 400 (input needs fixing, not retry).

Rule 4: Idempotent Operations Only

The rule: 'Only retry operations that are idempotent — producing the same result when called multiple times. Safe to retry: GET requests (read-only), PUT requests (replace resource, same result on repeat), DELETE requests (already deleted = success), and POST with Idempotency-Key header (server deduplicates). Not safe to retry without idempotency: POST /api/orders (creates a new order each time), POST /api/payments (charges the card each time), POST /api/emails (sends the email each time). For non-idempotent operations: add an Idempotency-Key, or do not retry.'

For implementing idempotency keys: 'Generate a unique key per operation: const idempotencyKey = crypto.randomUUID(). Send with the request: headers: { "Idempotency-Key": idempotencyKey }. The server: checks if this key was seen before (Redis or database lookup). If seen: return the stored result (no re-execution). If new: execute the operation, store the result with the key (TTL 24 hours). The client retries with the same key: the server returns the original result without re-executing. Safe to retry any operation with an idempotency key.'

AI generates: retry logic wrapped around a POST /api/payments with no idempotency key. Network timeout after the server processed the payment but before the client received the response. Client retries: second payment created. Customer charged twice. With idempotency key: the retry sends the same key, server returns the original successful result, customer charged once. One header prevents the most dangerous retry failure mode.

Rule 5: Retry Budgets for System-Wide Control

The rule: 'Set a retry budget: limit the total retry rate across all clients to a percentage of the normal request rate. Example: retry budget = 20% of normal traffic. Normal rate: 1000 requests/second. Max retries: 200/second. If retries exceed 200/second: stop retrying, fail immediately. This prevents: retry storms (a failing service triggers retries from all clients, multiplying the load), cascading failures (retries from service A overwhelm service B, causing service B to fail, triggering retries from service C), and resource exhaustion (retries consume the same connections and threads as normal requests).'

For implementation: 'Track retry rate with a sliding window counter: const retryCounter = new SlidingWindowCounter(60); // 60-second window. Before retrying: if (retryCounter.count() / normalRequestCounter.count() > 0.2) return failImmediately(); else { retryCounter.increment(); return retry(); }. The retry budget is a system-level safety valve — individual circuits may allow retries, but the budget caps the total system-wide retry volume. Google SRE recommends a 10% retry budget for production systems.'

AI generates: every client retries independently with no coordination. Service B fails: 100 clients each retry 5 times = 500 retry requests per second on top of 100 normal requests. Service B receives 600 requests/second (6x normal load) while already struggling. With a 20% retry budget: max 20 retries/second system-wide. Service B receives 120 requests/second (1.2x normal). The service has a chance to recover instead of being buried under retry load.

ℹ️ 6x Load vs 1.2x Load

100 clients each retrying 5 times: 600 requests/second to a service already struggling at 100/second. 20% retry budget: max 20 retries, service sees 120/second (1.2x). The service has a chance to recover instead of drowning under retry amplification.

Complete Retry Patterns Rules Template

Consolidated rules for retry patterns.

Exponential backoff: baseDelay * 2^attempt with max cap (30s) — not fixed delay
Full jitter: random(0, baseDelay * 2^attempt) — prevents thundering herd
Max retries: 3-5 for user-facing, 5-10 for background, or total time budget (30s)
Error classification: retry 503/429/timeouts, do not retry 400/401/403/404
Respect Retry-After header: server wait time overrides your backoff calculation
Idempotent only: GET/PUT/DELETE safe, POST needs Idempotency-Key for safe retry
Retry budget: max 10-20% of normal traffic as retries — system-wide safety valve
Log every retry: attempt number, delay, error type, total elapsed time