Best Practices

AI Rules for Logging and Observability

AI logs with console.log and no context. Rules for structured JSON logging, log levels, request tracing, correlation IDs, and connecting logs to metrics and traces.

8 min read·August 22, 2024

console.log is not observability — structured JSON with context is

JSON logging, log levels, correlation IDs, RED metrics, and symptom-based alerting

How AI Logs (And Why It Breaks Observability)

AI generates console.log for everything: console.log("User created"), console.log("Error:", err), console.log("Request received"). These logs are: unstructured (plain text, not JSON — unsearchable in log platforms), missing context (no request ID, user ID, or timestamp), at the wrong level (info-level for errors, debug-level for production), and impossible to alert on (you cannot set up alerts on unstructured text reliably).

Observability has three pillars: logs (what happened), metrics (how much/how fast), and traces (the path through the system). AI generates only unstructured logs — ignoring metrics and traces entirely. Without structured logs, you cannot: search for all errors in the last hour, filter by user ID, trace a request across services, or set up meaningful alerts.

These rules cover all three pillars with a focus on structured logging — the foundation that makes metrics and tracing useful.

Rule 1: Structured JSON Logging — Never console.log

The rule: 'Use a structured logger that outputs JSON: pino (Node.js), structlog (Python), zerolog (Go), logback with JSON encoder (Java). Every log entry is a JSON object with: level, message, timestamp, and context fields. Never use console.log, print(), or fmt.Println for application logging — they produce unstructured text that log platforms cannot index.'

For the log format: '{ "level": "info", "message": "User created", "userId": "abc-123", "email": "alice@example.com", "requestId": "req-456", "timestamp": "2026-03-28T10:30:00Z", "service": "user-api", "duration_ms": 45 }. Every field is searchable: find all logs for userId=abc-123, find all errors in the last hour, find all requests slower than 1 second.'

AI generates console.log("User created: " + user.email) — a string that cannot be searched by user ID, filtered by level, or aggregated by service. One structured logger replaces console.log everywhere and makes every log entry searchable, filterable, and alertable.

  • Structured JSON: pino (Node), structlog (Python), zerolog (Go), logback JSON (Java)
  • Every entry: level, message, timestamp, requestId, context fields
  • Never console.log, print(), fmt.Println for application logging
  • JSON is searchable: filter by userId, level, service, time range
  • Unstructured text is unsearchable: grep is not observability
⚠️ console.log = Blind

console.log('User created: ' + email) produces: unstructured text you cannot search by user ID, filter by level, or alert on. { level: 'info', message: 'User created', userId, email, requestId } is searchable on every field.

Rule 2: Proper Log Levels

The rule: 'Use log levels consistently: error (something failed and needs attention — alertable), warn (something unexpected but handled — monitor trends), info (significant business events — user created, order placed, payment processed), debug (development-only detail — SQL queries, cache hits/misses — disabled in production). Never use info for errors or debug for business events.'

For production: 'Set production log level to info — debug logs are disabled. This means: production logs show business events and errors, not internal implementation detail. If you need debug-level detail in production, enable it temporarily for one request using request-level log level override (most structured loggers support this).'

For what NOT to log: 'Never log at info level: function entry/exit, loop iterations, variable values, cache lookups. These are debug-level at most. In production, they add noise without value. Log at info level: business events (user signup, payment, order), configuration changes, startup/shutdown, and health status changes.'

  • error: failed + needs attention — alertable, pages oncall
  • warn: unexpected but handled — monitor trends, investigate if increasing
  • info: business events — user created, order placed, payment processed
  • debug: implementation detail — SQL, cache, internal state — disabled in production
  • Production = info level — debug noise adds cost and hides real events

Rule 3: Correlation IDs and Request Tracing

The rule: 'Generate a unique request ID at the entry point (API gateway, load balancer, or first middleware): const requestId = crypto.randomUUID(). Include in every log entry for that request. Pass to downstream services in a header: X-Request-Id. Every log entry, in every service, for one user request shares the same requestId — you can trace the entire request path with one filter.'

For implementation: 'Middleware sets the request ID and attaches to the logger context: app.use((req, res, next) => { req.id = req.headers["x-request-id"] || crypto.randomUUID(); logger = logger.child({ requestId: req.id }); next(); }). All subsequent log calls in this request include requestId automatically. Return the request ID in the response header for client-side debugging.'

For distributed tracing: 'Use OpenTelemetry for spans across services: each service creates a span, spans link via trace context propagation (W3C Trace Context header). Trace ID = the distributed version of request ID — it follows the request across multiple services. Connect traces to logs: include traceId and spanId in every log entry.'

💡 One ID, Entire Path

A single requestId included in every log entry lets you filter the complete history of one user request — across middleware, services, database calls, and error handlers. One UUID, attached once, searchable everywhere.

Rule 4: Metrics for Quantitative Observability

The rule: 'Collect four types of metrics: counters (total requests, total errors — monotonically increasing), gauges (current connections, queue depth — goes up and down), histograms (request duration distribution — P50, P95, P99), and rates (requests per second, errors per second — derived from counters). Use Prometheus client libraries for metric collection. Expose a /metrics endpoint for scraping.'

For the RED method: 'Monitor every service with three metrics: Rate (requests per second), Errors (error rate as percentage), and Duration (latency histogram — P50, P95, P99). These three metrics answer: is the service receiving traffic? is it failing? is it slow? Set alerts on all three: rate drop > 50%, error rate > 1%, P99 > SLA threshold.'

AI generates no metrics — you are blind to service health until users complain. Three metrics (rate, errors, duration) with three alerts give you complete service visibility. Prometheus + Grafana is the standard open-source stack.

  • Counters: total requests, total errors — monotonically increasing
  • Gauges: current connections, queue depth — current value
  • Histograms: request duration — P50, P95, P99 distribution
  • RED method: Rate, Errors, Duration — three metrics per service
  • Prometheus client + /metrics endpoint — Grafana for dashboards

Rule 5: Alerting That Actually Works

The rule: 'Alert on symptoms, not causes. Symptom: error rate > 1% for 5 minutes. Cause: database connection pool exhausted. Alert on the symptom — it catches all causes, including ones you did not anticipate. Use multi-window alerts: if error rate > 5% for 1 minute OR > 1% for 15 minutes — catches both spikes and slow burns.'

For alert routing: 'Critical (pages oncall): error rate > 5%, P99 > 2x SLA, zero traffic for 5 minutes. Warning (Slack): error rate > 1%, P95 > SLA, unusual traffic pattern. Info (email/dashboard): daily summary, weekly trends, capacity planning. Never alert on everything — alert fatigue causes real alerts to be ignored.'

For log-based alerts: 'Alert on: new error type (never seen before), error rate spike (relative increase, not absolute threshold), specific error codes (PAYMENT_FAILED count > 10/minute), and absence (no healthcheck logs for 5 minutes = service is down). Use structured logs for reliable alerting — unstructured text produces false positives and missed alerts.'

ℹ️ Symptoms, Not Causes

Alert on 'error rate > 1% for 5 minutes' — it catches every cause, including ones you did not anticipate. Alerting on specific causes ('database timeout') misses the novel failure modes that actually page you at 3 AM.

Complete Logging & Observability Rules Template

Consolidated rules for logging and observability.

  • Structured JSON logger: pino/structlog/zerolog — never console.log/print/Println
  • Every entry: level, message, timestamp, requestId, context fields — all searchable
  • Log levels: error (alert), warn (monitor), info (business events), debug (dev only)
  • Correlation ID: requestId on every log — X-Request-Id header across services
  • OpenTelemetry: traceId + spanId in logs — distributed tracing across services
  • RED metrics: Rate, Errors, Duration per service — Prometheus + Grafana
  • Alert on symptoms: error rate, latency, traffic drop — not specific causes
  • Alert routing: critical → page, warning → Slack, info → dashboard — no alert fatigue