Best Practices

AI Rules for Rate Limiting and Throttling

AI generates APIs without rate limiting — every endpoint is an unlimited resource. Rules for token bucket, sliding window, tiered limits, and proper 429 responses.

7 min read·October 8, 2024

An API without rate limiting is an open invitation for abuse and DDoS

Token bucket, sliding window, per-user limits, 429 responses, and Redis-backed limiters

Why Every API Needs Rate Limiting Rules

AI generates APIs with unlimited access — every endpoint serves as many requests as arrive, as fast as they arrive. This is an open invitation for: DDoS attacks (flood the API, overwhelm the server), credential stuffing (try millions of passwords against the login endpoint), scraping (download your entire database through the list endpoint), and cost amplification (trigger expensive operations — AI inference, email sending — thousands of times per second).

Rate limiting is not optional infrastructure — it is a security control as fundamental as authentication. An authenticated API without rate limiting is still abusable: a valid user (or a compromised API key) can exhaust server resources, rack up cloud costs, and degrade service for everyone else.

These rules cover: algorithm choice (token bucket vs sliding window), limit configuration (per-user, per-IP, per-endpoint), response format (429 with Retry-After), and implementation (Redis-backed for distributed systems, in-memory for single instances).

Rule 1: Token Bucket or Sliding Window — Not Fixed Window

The rule: 'Use token bucket or sliding window log for rate limiting — never fixed window. Token bucket: allows bursts up to the bucket size, refills at a steady rate. Sliding window: counts requests in a rolling time window, no burst allowance. Fixed window (reset every minute) has the double-rate boundary problem — a user can send 2x the limit at the window boundary.'

For token bucket: 'Configure: bucket size (max burst — e.g., 100 requests), refill rate (steady state — e.g., 10 requests/second). A user can burst 100 requests immediately, then is limited to 10/second. This is ideal for: API endpoints that tolerate bursts, real-time applications, and endpoints where occasional high throughput is expected.'

For sliding window: 'Count requests in the last N seconds. If count > limit, reject. No burst — the limit is strict over any rolling window. This is ideal for: authentication endpoints (no burst = no brute force), expensive operations (AI inference, payment processing), and endpoints where consistent throughput matters more than burst tolerance.'

  • Token bucket: burst-tolerant — bucket size for bursts, refill rate for steady state
  • Sliding window: strict — no burst, rolling count over time window
  • Never fixed window: 2x rate possible at window boundary — vulnerable to bursts
  • Token bucket for general APIs — sliding window for auth and expensive ops
  • Both store state (counts/tokens) — need Redis for distributed systems
⚠️ Never Fixed Window

Fixed window rate limiting (reset every 60s) allows 2x the limit at the boundary: 100 requests at 0:59, 100 more at 1:01 = 200 in 2 seconds. Token bucket and sliding window prevent this entirely.

Rule 2: Per-User, Per-IP, and Per-Endpoint Limits

The rule: 'Apply multiple rate limit layers: per-IP for unauthenticated requests (protects against DDoS from a single source), per-user for authenticated requests (protects against account abuse), and per-endpoint for expensive operations (stricter limits on /api/ai/generate than /api/users). Each layer has its own limit — the strictest applies.'

For limit values: 'General API: 100 requests/minute per user. Authentication: 5 attempts/minute per IP (prevents brute force). Expensive operations: 10 requests/minute per user. Public read endpoints: 1000 requests/minute per IP. Webhooks: 100 requests/minute per source. Adjust based on your actual traffic patterns — start strict, loosen based on data.'

AI generates one global rate limit (or none). Layered limits protect different attack surfaces: IP limits protect against anonymous attacks, user limits protect against compromised accounts, endpoint limits protect against expensive operation abuse. All three together provide defense in depth.

Rule 3: 429 with Retry-After and Rate Limit Headers

The rule: 'Return HTTP 429 Too Many Requests when rate limited. Include headers: Retry-After (seconds until the client can retry), X-RateLimit-Limit (total allowed), X-RateLimit-Remaining (remaining in window), X-RateLimit-Reset (Unix timestamp when window resets). The response body should be JSON: { "error": { "message": "Rate limit exceeded", "code": "RATE_LIMITED", "retryAfter": 30 } }.'

For Retry-After: 'Calculate based on the algorithm: token bucket → time until next token. Sliding window → time until oldest request falls out of window. Clients that respect Retry-After automatically back off — reducing pressure on your API. Well-behaved clients never see 429 more than once per burst.'

AI returns 403 or 500 for rate limiting — neither tells the client what happened or when to retry. 429 is the correct status code. Retry-After is the correct header. Together, they let clients self-regulate — reducing the rate, not the entire operation.

  • 429 Too Many Requests — not 403, not 500, not 200 with error in body
  • Retry-After: seconds until client can retry — clients auto-back-off
  • X-RateLimit-Limit: total allowed in window
  • X-RateLimit-Remaining: requests left in current window
  • X-RateLimit-Reset: Unix timestamp when window resets
💡 429, Not 403

AI returns 403 Forbidden for rate limits — wrong code, wrong semantics. 429 Too Many Requests + Retry-After header lets clients self-regulate: wait N seconds, retry. Well-behaved clients never see 429 more than once per burst.

Rule 4: Redis-Backed Limiters for Distributed Systems

The rule: 'Use Redis for rate limiting in distributed systems (multiple server instances, serverless). In-memory rate limiting fails when: requests hit different server instances (each sees partial traffic), servers restart (counts reset), or the app runs serverless (no persistent memory). Redis provides: shared state across instances, atomic operations (INCR + EXPIRE), and persistence across restarts.'

For implementation: 'Use a rate limiting library: rate-limiter-flexible (Node.js), limits (Python), or redis-rate-limiter (Go). These handle: Redis connection, atomic operations, key management, and cleanup. For simple cases: INCR the key, check against limit, EXPIRE for window reset. Use Lua scripts for atomic multi-step operations (check + increment + set expiry in one round-trip).'

For serverless: 'Use Vercel KV, Upstash Redis, or Cloudflare KV for serverless rate limiting. These provide: Redis-compatible API, global distribution, and low-latency access from edge locations. upstash/ratelimit is purpose-built for serverless: const ratelimit = new Ratelimit({ redis, limiter: Ratelimit.slidingWindow(10, "10 s") }).'

ℹ️ Redis for Distributed

In-memory rate limiting fails with multiple server instances — each sees partial traffic. Redis provides shared state: all instances check/increment the same counter atomically. Use for any app with more than one instance.

Rule 5: Tiered Limits and Graceful Degradation

The rule: 'Implement tiered rate limits based on user plan: free users (100 req/min), pro users (1000 req/min), enterprise users (custom — negotiated). Store the limit in the user record or plan configuration. The rate limiter reads the limit per user — not a global constant. This enables: plan differentiation, fair usage, and upsell opportunities.'

For graceful degradation: 'When approaching the limit, degrade gracefully instead of hard-cutting at the limit. At 80% of limit: add a warning header (X-RateLimit-Warning: approaching). At 100%: return 429. For critical endpoints (health checks, auth token refresh), exempt from rate limiting — these must always work. For background jobs, use separate rate limit pools — batch processing should not compete with user-facing traffic.'

For monitoring: 'Track rate limit metrics: total 429 responses per endpoint, per user, per IP. Alert on: sustained 429 rate (indicates a real user hitting limits — might need a limit increase), 429 spike from one IP (attack), and zero 429s (limits might be too generous — are they actually protecting anything?).'

  • Tiered: free (100/min), pro (1000/min), enterprise (custom) — plan-based limits
  • Warning at 80% — 429 at 100% — exempt health checks and auth refresh
  • Separate pools: user-facing vs background jobs — no competition
  • Monitor: 429 per endpoint/user/IP — alert on sustained or spiked rejections
  • Zero 429s = limits too generous — not enough protection

Complete Rate Limiting Rules Template

Consolidated rules for rate limiting.

  • Token bucket for general APIs — sliding window for auth and expensive operations
  • Never fixed window — boundary problem allows 2x rate
  • Layered: per-IP (DDoS), per-user (abuse), per-endpoint (expensive ops)
  • 429 + Retry-After + X-RateLimit-* headers — never 403 or 500 for rate limits
  • Redis for distributed — in-memory only for single-instance apps
  • Upstash/Vercel KV for serverless — @upstash/ratelimit purpose-built
  • Tiered limits by plan — exempt health checks — warn at 80%
  • Monitor 429 metrics — alert on sustained or spiked rejections