SRE Principles as AI Rules
Site Reliability Engineering applies software engineering principles to operations. Key principles that translate to AI rules: SLOs (Service Level Objectives) define acceptable reliability. Error budgets: the allowed amount of unreliability (100% - SLO). Toil reduction: automate repetitive operational tasks. Observability: metrics, logs, and traces for understanding system behavior. Incident response: structured processes for detecting, responding to, and learning from failures.
AI rules for SRE encode these principles into code generation. When the AI generates a new service: it includes SLO-aware design (circuit breakers, retries, fallbacks), observability instrumentation (metrics, structured logs, trace propagation), and operational readiness (health checks, graceful shutdown, configuration management). The SRE team's AI rules ensure that every new service is production-ready from the first deployment.
The SRE team produces AI rules that: define reliability patterns (retry with exponential backoff, circuit breakers, bulkheads), define observability standards (what to measure, how to log, how to trace), and define operational standards (deployment strategies, rollback procedures, incident response hooks).
SLO-Driven Development and Error Budgets
SLOs define target reliability. Example: availability SLO = 99.9% (43.8 minutes of downtime per month allowed), latency SLO = p99 < 200ms, error rate SLO = < 0.1% of requests return 5xx. AI rule: 'Every service has SLOs defined before deployment. The AI generates code designed to meet the SLOs: connection pooling and caching for latency targets, retries and fallbacks for availability targets, and input validation for error rate targets.'
Error budget: the difference between 100% and the SLO. 99.9% availability = 0.1% error budget = 43.8 minutes per month. When the error budget is consumed: freeze feature deployments and focus on reliability. AI rule: 'Error budget monitoring: generate dashboards that show remaining error budget in real-time. When error budget is below 50%: the AI should generate more conservative code (additional retries, larger timeouts, more aggressive caching). When error budget is exhausted: flag that feature deployments should be paused.'
SLO-informed architecture: the SLO determines the architecture. A 99.9% availability SLO: can be met with a single region, standard deployment. A 99.99% availability SLO: requires multi-region, active-active, automatic failover. AI rule: 'The SLO informs the architecture. Do not over-engineer (99.99% architecture for a 99.9% SLO wastes resources). Do not under-engineer (single-region for a 99.99% SLO is a design failure). The AI should ask: what is the SLO? and design accordingly.'
Error budget is the most practical SRE concept: if the service has budget remaining (say 30 minutes of the 43.8 minutes allowed this month): ship features, take risks, move fast. If the budget is nearly exhausted: slow down, focus on reliability, fix the issues consuming the budget. This replaces subjective debates ('is it safe to deploy?') with objective data ('do we have error budget?'). The AI should generate error budget dashboards as part of every service's observability setup.
Observability: Metrics, Logs, Traces
Metrics (RED method): Rate (requests per second), Errors (error rate), Duration (latency distribution). Every endpoint emits RED metrics. AI rule: 'Generate metrics instrumentation for every endpoint: request count (with labels for method, path, status code), error count (with labels for error type), and duration histogram (p50, p90, p99). Use the organization's metrics library (Prometheus client, OpenTelemetry metrics). Emit metrics at the middleware level, not per-endpoint â this ensures no endpoint is missed.'
Structured logging: every log entry is structured JSON with standard fields. AI rule: 'Log format: JSON with fields: timestamp, level, service, message, trace_id, span_id, and context-specific fields. No unstructured log.info("something happened"). Every log entry must be parseable by the log aggregation system. Include correlation IDs (trace_id) in every log entry for cross-service request tracing.'
Distributed tracing: trace requests across service boundaries. AI rule: 'OpenTelemetry instrumentation: propagate trace context (traceparent header) across all HTTP and gRPC calls. Generate spans for: incoming requests, outgoing HTTP calls, database queries, cache operations, and queue operations. Include span attributes: http.method, http.url, db.statement (sanitized), and custom business attributes. The AI should use the platform's tracing library, not vendor-specific SDKs.'
console.log('Payment failed for user ' + userId) is human-readable but invisible to log aggregation queries, automated alerting rules, and dashboards. Structured: {event: 'payment_failed', userId: 'abc', amount: 1299, reason: 'insufficient_funds', traceId: 'xyz'}. Now you can: query for all payment failures, alert when the failure rate exceeds threshold, graph failure reasons, and trace the request across services. The AI must generate structured log entries with consistent field names.
Incident Response and Toil Reduction
Incident response automation: when an alert fires, the system should: create an incident record, page the on-call engineer, assemble an incident channel, provide relevant dashboards and runbooks, and start a timeline. AI rule: 'Generate alerting rules alongside the service. For each SLO: a corresponding alert that fires when the SLO is at risk. Include: alert severity, runbook link, dashboard link, and escalation policy. The AI generates the alert configuration (Prometheus alert rules, PagerDuty integration) as part of the service deployment.'
Runbook generation: runbooks document how to diagnose and resolve known issues. AI rule: 'For each alert: generate a corresponding runbook. Runbook structure: alert description (what triggered), impact assessment (what is affected), diagnostic steps (what to check first), remediation steps (how to fix), and escalation (when to escalate). The AI generates runbooks as Markdown files in the docs/runbooks/ directory, linked from the alert configuration.'
Toil reduction: toil is repetitive, manual, automatable work that scales linearly with service growth. The SRE team's goal: keep toil below 50% of team time. AI rule: 'When the AI generates operational procedures: generate automation, not manual steps. Database migration: automated script, not manual SQL. Certificate rotation: automated job, not manual renewal. Log cleanup: automated retention policy, not manual deletion. Every manual procedure the AI generates should include a TODO for automation.'
Without runbooks: the senior engineer who has been on-call for 3 years knows how to fix the issue. Everyone else escalates. With runbooks: any on-call engineer can follow the diagnostic and remediation steps. Runbooks capture tribal knowledge and make it accessible. The AI should generate a runbook for every alert it creates. The runbook does not need to be perfect â it needs to give the on-call engineer a starting point instead of a blank page.
SRE AI Rules Summary
Summary of AI rules for SRE teams maintaining production reliability.
- SLOs: defined before deployment. Code designed to meet targets. Error budget monitored in real-time
- Error budget: < 50% â conservative code. Exhausted â pause feature deployments
- Architecture: matches the SLO. 99.9% = single region. 99.99% = multi-region active-active
- Metrics: RED method on every endpoint. Middleware-level instrumentation. Standard labels
- Logging: structured JSON, correlation IDs, standard fields. Parseable by log aggregation
- Tracing: OpenTelemetry, trace context propagation, spans for HTTP/DB/cache/queue
- Incidents: alerting per SLO, auto-created incident channels, runbooks linked from alerts
- Toil: automate everything. < 50% toil target. AI generates automation, not manual procedures