AI Rules for Kubernetes Manifests

Why Kubernetes Manifest Rules Are Critical

Kubernetes manifests define how your application runs in production — resource allocation, networking, security boundaries, and failure recovery. AI-generated manifests that 'work' in a dev cluster can cause outages, security breaches, and runaway costs in production because they skip every production hardening pattern.

The most common AI failures: no resource limits (one pod consumes all node resources, starving others), running as root (container escape gives node-level access), no health probes (Kubernetes routes traffic to unhealthy pods), no pod disruption budgets (rolling updates cause downtime), and no network policies (every pod can talk to every other pod).

Unlike application code where you can iterate quickly, Kubernetes misconfigurations in production cause immediate operational incidents. These rules are your safety net.

Rule 1: Resource Requests and Limits

The rule: 'Every container must have resource requests and limits defined. Requests guarantee minimum resources for scheduling. Limits cap maximum usage to prevent noisy-neighbor problems. CPU: set requests to typical usage, limits to 2-4x requests. Memory: set requests equal to limits (memory is non-compressible — exceeding limits causes OOMKill).'

For sizing: 'Start with: requests: { cpu: 100m, memory: 128Mi }, limits: { cpu: 500m, memory: 256Mi } for small services. Monitor actual usage with metrics (Prometheus, Datadog) and adjust. Never set limits without requests — it breaks the scheduler's ability to pack pods efficiently.'

AI assistants omit resource specs entirely because they're optional in Kubernetes. Without limits, a single pod can consume all CPU and memory on a node, starving every other pod. This is the #1 production incident cause from AI-generated manifests.

⚠️ #1 Incident Cause

Missing resource limits is the top production incident cause from AI-generated manifests. One pod without limits can consume all node resources, starving every other pod on the node.

Rule 2: Security Contexts

The rule: 'Every pod must have a security context that: runs as non-root (runAsNonRoot: true), drops all capabilities (drop: [ALL]), uses a read-only root filesystem (readOnlyRootFilesystem: true), and prevents privilege escalation (allowPrivilegeEscalation: false). Mount writable volumes only for directories that genuinely need writes (tmp, logs, upload cache).'

For service accounts: 'Use dedicated service accounts per deployment — never the default service account. Set automountServiceAccountToken: false unless the pod needs to call the Kubernetes API. Bind RBAC roles with minimum required permissions.'

The minimal security context template: securityContext: { runAsNonRoot: true, runAsUser: 1000, fsGroup: 1000, readOnlyRootFilesystem: true, allowPrivilegeEscalation: false, capabilities: { drop: [ALL] } }. This should be the default on every container — remove restrictions only with documented justification.

runAsNonRoot: true — never run as root in production
readOnlyRootFilesystem: true — writable tmpdir via emptyDir volume
allowPrivilegeEscalation: false — prevents container escape
capabilities: { drop: [ALL] } — add back only what's needed
Dedicated service accounts — never default, automount only when needed

💡 Security Context Template

Copy this to every container: runAsNonRoot: true, readOnlyRootFilesystem: true, allowPrivilegeEscalation: false, capabilities: { drop: [ALL] }. Remove restrictions only with documented justification.

Rule 3: Health Probes

The rule: 'Every deployment must have livenessProbe and readinessProbe defined. livenessProbe: is the process alive? (restart if not). readinessProbe: can it serve traffic? (remove from load balancer if not). Use httpGet probes for web services, exec probes for non-HTTP services. Set initialDelaySeconds based on actual startup time — don't guess.'

For startup: 'Use startupProbe for slow-starting applications (JVM, large model loading). The startup probe runs first — liveness and readiness don't start until startup succeeds. This prevents Kubernetes from killing pods that are still initializing.'

For configuration: 'livenessProbe: httpGet /healthz, period 10s, failure threshold 3 (restart after 30s of failure). readinessProbe: httpGet /ready, period 5s, failure threshold 2 (stop traffic after 10s of failure). The health endpoint should check dependencies — database connectivity, cache availability — not just return 200.'

Rule 4: Deployment Strategy and Disruption Budgets

The rule: 'Use RollingUpdate strategy with maxSurge: 1 and maxUnavailable: 0 for zero-downtime deployments. Define PodDisruptionBudget with minAvailable: 1 (or percentage) to prevent voluntary disruptions from taking all pods offline. Set terminationGracePeriodSeconds to match your application's shutdown time — default 30s is often too short for connection draining.'

For replicas: 'Minimum 2 replicas for any production service. Use HorizontalPodAutoscaler (HPA) for dynamic scaling based on CPU or custom metrics. Set HPA minReplicas >= 2, maxReplicas based on your capacity plan. Never run a single replica in production — it's a single point of failure.'

For rollback: 'Set revisionHistoryLimit: 5 to keep recent ReplicaSets for fast rollback. Use kubectl rollout undo for instant rollback. Define deployment annotations to track which commit each revision corresponds to.'

Rule 5: Networking and Network Policies

The rule: 'Define NetworkPolicy for every deployment. Default deny all ingress and egress. Explicitly allow only required traffic: frontend → API on port 8080, API → database on port 5432, API → cache on port 6379. This prevents lateral movement if one pod is compromised.'

For services: 'Use ClusterIP (internal) by default — never NodePort or LoadBalancer for internal services. Use Ingress or Gateway API for external traffic with TLS termination. Define resource-specific services — one Service per Deployment, matching selectors.'

For DNS: 'Use fully qualified service names in other-namespace references: service.namespace.svc.cluster.local. Use short names for same-namespace: service. Configure external DNS entries through Ingress annotations, not manual records.'

NetworkPolicy: default deny, explicit allow for required traffic paths
ClusterIP by default — Ingress/Gateway for external traffic with TLS
One Service per Deployment — matching label selectors
FQDN for cross-namespace: service.namespace.svc.cluster.local
Ingress with TLS — cert-manager for automatic certificate management

ℹ️ Default Deny

Without NetworkPolicy, every pod can talk to every other pod. Default deny + explicit allow prevents lateral movement if one pod is compromised — the network equivalent of least privilege.

Complete Kubernetes Rules Template

Consolidated rules for Kubernetes manifests. These apply to any K8s distribution — EKS, GKE, AKS, or self-managed.

Resource requests AND limits on every container — memory requests = limits
Security context: non-root, read-only FS, drop all caps, no privilege escalation
livenessProbe + readinessProbe + startupProbe (slow apps) on every deployment
RollingUpdate with maxUnavailable: 0 — PodDisruptionBudget with minAvailable: 1
Minimum 2 replicas + HPA for production — never single replica
NetworkPolicy: default deny, explicit allow per traffic path
Dedicated service accounts — RBAC with least privilege
kube-linter or Polaris in CI — fail on security and best practice violations