How AI Logs (And Why It Breaks Observability)
AI generates console.log for everything: console.log("User created"), console.log("Error:", err), console.log("Request received"). These logs are: unstructured (plain text, not JSON — unsearchable in log platforms), missing context (no request ID, user ID, or timestamp), at the wrong level (info-level for errors, debug-level for production), and impossible to alert on (you cannot set up alerts on unstructured text reliably).
Observability has three pillars: logs (what happened), metrics (how much/how fast), and traces (the path through the system). AI generates only unstructured logs — ignoring metrics and traces entirely. Without structured logs, you cannot: search for all errors in the last hour, filter by user ID, trace a request across services, or set up meaningful alerts.
These rules cover all three pillars with a focus on structured logging — the foundation that makes metrics and tracing useful.
Rule 1: Structured JSON Logging — Never console.log
The rule: 'Use a structured logger that outputs JSON: pino (Node.js), structlog (Python), zerolog (Go), logback with JSON encoder (Java). Every log entry is a JSON object with: level, message, timestamp, and context fields. Never use console.log, print(), or fmt.Println for application logging — they produce unstructured text that log platforms cannot index.'
For the log format: '{ "level": "info", "message": "User created", "userId": "abc-123", "email": "alice@example.com", "requestId": "req-456", "timestamp": "2026-03-28T10:30:00Z", "service": "user-api", "duration_ms": 45 }. Every field is searchable: find all logs for userId=abc-123, find all errors in the last hour, find all requests slower than 1 second.'
AI generates console.log("User created: " + user.email) — a string that cannot be searched by user ID, filtered by level, or aggregated by service. One structured logger replaces console.log everywhere and makes every log entry searchable, filterable, and alertable.
- Structured JSON: pino (Node), structlog (Python), zerolog (Go), logback JSON (Java)
- Every entry: level, message, timestamp, requestId, context fields
- Never console.log, print(), fmt.Println for application logging
- JSON is searchable: filter by userId, level, service, time range
- Unstructured text is unsearchable: grep is not observability
console.log('User created: ' + email) produces: unstructured text you cannot search by user ID, filter by level, or alert on. { level: 'info', message: 'User created', userId, email, requestId } is searchable on every field.
Rule 2: Proper Log Levels
The rule: 'Use log levels consistently: error (something failed and needs attention — alertable), warn (something unexpected but handled — monitor trends), info (significant business events — user created, order placed, payment processed), debug (development-only detail — SQL queries, cache hits/misses — disabled in production). Never use info for errors or debug for business events.'
For production: 'Set production log level to info — debug logs are disabled. This means: production logs show business events and errors, not internal implementation detail. If you need debug-level detail in production, enable it temporarily for one request using request-level log level override (most structured loggers support this).'
For what NOT to log: 'Never log at info level: function entry/exit, loop iterations, variable values, cache lookups. These are debug-level at most. In production, they add noise without value. Log at info level: business events (user signup, payment, order), configuration changes, startup/shutdown, and health status changes.'
- error: failed + needs attention — alertable, pages oncall
- warn: unexpected but handled — monitor trends, investigate if increasing
- info: business events — user created, order placed, payment processed
- debug: implementation detail — SQL, cache, internal state — disabled in production
- Production = info level — debug noise adds cost and hides real events
Rule 3: Correlation IDs and Request Tracing
The rule: 'Generate a unique request ID at the entry point (API gateway, load balancer, or first middleware): const requestId = crypto.randomUUID(). Include in every log entry for that request. Pass to downstream services in a header: X-Request-Id. Every log entry, in every service, for one user request shares the same requestId — you can trace the entire request path with one filter.'
For implementation: 'Middleware sets the request ID and attaches to the logger context: app.use((req, res, next) => { req.id = req.headers["x-request-id"] || crypto.randomUUID(); logger = logger.child({ requestId: req.id }); next(); }). All subsequent log calls in this request include requestId automatically. Return the request ID in the response header for client-side debugging.'
For distributed tracing: 'Use OpenTelemetry for spans across services: each service creates a span, spans link via trace context propagation (W3C Trace Context header). Trace ID = the distributed version of request ID — it follows the request across multiple services. Connect traces to logs: include traceId and spanId in every log entry.'
A single requestId included in every log entry lets you filter the complete history of one user request — across middleware, services, database calls, and error handlers. One UUID, attached once, searchable everywhere.
Rule 4: Metrics for Quantitative Observability
The rule: 'Collect four types of metrics: counters (total requests, total errors — monotonically increasing), gauges (current connections, queue depth — goes up and down), histograms (request duration distribution — P50, P95, P99), and rates (requests per second, errors per second — derived from counters). Use Prometheus client libraries for metric collection. Expose a /metrics endpoint for scraping.'
For the RED method: 'Monitor every service with three metrics: Rate (requests per second), Errors (error rate as percentage), and Duration (latency histogram — P50, P95, P99). These three metrics answer: is the service receiving traffic? is it failing? is it slow? Set alerts on all three: rate drop > 50%, error rate > 1%, P99 > SLA threshold.'
AI generates no metrics — you are blind to service health until users complain. Three metrics (rate, errors, duration) with three alerts give you complete service visibility. Prometheus + Grafana is the standard open-source stack.
- Counters: total requests, total errors — monotonically increasing
- Gauges: current connections, queue depth — current value
- Histograms: request duration — P50, P95, P99 distribution
- RED method: Rate, Errors, Duration — three metrics per service
- Prometheus client + /metrics endpoint — Grafana for dashboards
Rule 5: Alerting That Actually Works
The rule: 'Alert on symptoms, not causes. Symptom: error rate > 1% for 5 minutes. Cause: database connection pool exhausted. Alert on the symptom — it catches all causes, including ones you did not anticipate. Use multi-window alerts: if error rate > 5% for 1 minute OR > 1% for 15 minutes — catches both spikes and slow burns.'
For alert routing: 'Critical (pages oncall): error rate > 5%, P99 > 2x SLA, zero traffic for 5 minutes. Warning (Slack): error rate > 1%, P95 > SLA, unusual traffic pattern. Info (email/dashboard): daily summary, weekly trends, capacity planning. Never alert on everything — alert fatigue causes real alerts to be ignored.'
For log-based alerts: 'Alert on: new error type (never seen before), error rate spike (relative increase, not absolute threshold), specific error codes (PAYMENT_FAILED count > 10/minute), and absence (no healthcheck logs for 5 minutes = service is down). Use structured logs for reliable alerting — unstructured text produces false positives and missed alerts.'
Alert on 'error rate > 1% for 5 minutes' — it catches every cause, including ones you did not anticipate. Alerting on specific causes ('database timeout') misses the novel failure modes that actually page you at 3 AM.
Complete Logging & Observability Rules Template
Consolidated rules for logging and observability.
- Structured JSON logger: pino/structlog/zerolog — never console.log/print/Println
- Every entry: level, message, timestamp, requestId, context fields — all searchable
- Log levels: error (alert), warn (monitor), info (business events), debug (dev only)
- Correlation ID: requestId on every log — X-Request-Id header across services
- OpenTelemetry: traceId + spanId in logs — distributed tracing across services
- RED metrics: Rate, Errors, Duration per service — Prometheus + Grafana
- Alert on symptoms: error rate, latency, traffic drop — not specific causes
- Alert routing: critical → page, warning → Slack, info → dashboard — no alert fatigue