Why Kubernetes Manifest Rules Are Critical
Kubernetes manifests define how your application runs in production — resource allocation, networking, security boundaries, and failure recovery. AI-generated manifests that 'work' in a dev cluster can cause outages, security breaches, and runaway costs in production because they skip every production hardening pattern.
The most common AI failures: no resource limits (one pod consumes all node resources, starving others), running as root (container escape gives node-level access), no health probes (Kubernetes routes traffic to unhealthy pods), no pod disruption budgets (rolling updates cause downtime), and no network policies (every pod can talk to every other pod).
Unlike application code where you can iterate quickly, Kubernetes misconfigurations in production cause immediate operational incidents. These rules are your safety net.
Rule 1: Resource Requests and Limits
The rule: 'Every container must have resource requests and limits defined. Requests guarantee minimum resources for scheduling. Limits cap maximum usage to prevent noisy-neighbor problems. CPU: set requests to typical usage, limits to 2-4x requests. Memory: set requests equal to limits (memory is non-compressible — exceeding limits causes OOMKill).'
For sizing: 'Start with: requests: { cpu: 100m, memory: 128Mi }, limits: { cpu: 500m, memory: 256Mi } for small services. Monitor actual usage with metrics (Prometheus, Datadog) and adjust. Never set limits without requests — it breaks the scheduler's ability to pack pods efficiently.'
AI assistants omit resource specs entirely because they're optional in Kubernetes. Without limits, a single pod can consume all CPU and memory on a node, starving every other pod. This is the #1 production incident cause from AI-generated manifests.
Missing resource limits is the top production incident cause from AI-generated manifests. One pod without limits can consume all node resources, starving every other pod on the node.
Rule 2: Security Contexts
The rule: 'Every pod must have a security context that: runs as non-root (runAsNonRoot: true), drops all capabilities (drop: [ALL]), uses a read-only root filesystem (readOnlyRootFilesystem: true), and prevents privilege escalation (allowPrivilegeEscalation: false). Mount writable volumes only for directories that genuinely need writes (tmp, logs, upload cache).'
For service accounts: 'Use dedicated service accounts per deployment — never the default service account. Set automountServiceAccountToken: false unless the pod needs to call the Kubernetes API. Bind RBAC roles with minimum required permissions.'
The minimal security context template: securityContext: { runAsNonRoot: true, runAsUser: 1000, fsGroup: 1000, readOnlyRootFilesystem: true, allowPrivilegeEscalation: false, capabilities: { drop: [ALL] } }. This should be the default on every container — remove restrictions only with documented justification.
- runAsNonRoot: true — never run as root in production
- readOnlyRootFilesystem: true — writable tmpdir via emptyDir volume
- allowPrivilegeEscalation: false — prevents container escape
- capabilities: { drop: [ALL] } — add back only what's needed
- Dedicated service accounts — never default, automount only when needed
Copy this to every container: runAsNonRoot: true, readOnlyRootFilesystem: true, allowPrivilegeEscalation: false, capabilities: { drop: [ALL] }. Remove restrictions only with documented justification.
Rule 3: Health Probes
The rule: 'Every deployment must have livenessProbe and readinessProbe defined. livenessProbe: is the process alive? (restart if not). readinessProbe: can it serve traffic? (remove from load balancer if not). Use httpGet probes for web services, exec probes for non-HTTP services. Set initialDelaySeconds based on actual startup time — don't guess.'
For startup: 'Use startupProbe for slow-starting applications (JVM, large model loading). The startup probe runs first — liveness and readiness don't start until startup succeeds. This prevents Kubernetes from killing pods that are still initializing.'
For configuration: 'livenessProbe: httpGet /healthz, period 10s, failure threshold 3 (restart after 30s of failure). readinessProbe: httpGet /ready, period 5s, failure threshold 2 (stop traffic after 10s of failure). The health endpoint should check dependencies — database connectivity, cache availability — not just return 200.'
Rule 4: Deployment Strategy and Disruption Budgets
The rule: 'Use RollingUpdate strategy with maxSurge: 1 and maxUnavailable: 0 for zero-downtime deployments. Define PodDisruptionBudget with minAvailable: 1 (or percentage) to prevent voluntary disruptions from taking all pods offline. Set terminationGracePeriodSeconds to match your application's shutdown time — default 30s is often too short for connection draining.'
For replicas: 'Minimum 2 replicas for any production service. Use HorizontalPodAutoscaler (HPA) for dynamic scaling based on CPU or custom metrics. Set HPA minReplicas >= 2, maxReplicas based on your capacity plan. Never run a single replica in production — it's a single point of failure.'
For rollback: 'Set revisionHistoryLimit: 5 to keep recent ReplicaSets for fast rollback. Use kubectl rollout undo for instant rollback. Define deployment annotations to track which commit each revision corresponds to.'
Rule 5: Networking and Network Policies
The rule: 'Define NetworkPolicy for every deployment. Default deny all ingress and egress. Explicitly allow only required traffic: frontend → API on port 8080, API → database on port 5432, API → cache on port 6379. This prevents lateral movement if one pod is compromised.'
For services: 'Use ClusterIP (internal) by default — never NodePort or LoadBalancer for internal services. Use Ingress or Gateway API for external traffic with TLS termination. Define resource-specific services — one Service per Deployment, matching selectors.'
For DNS: 'Use fully qualified service names in other-namespace references: service.namespace.svc.cluster.local. Use short names for same-namespace: service. Configure external DNS entries through Ingress annotations, not manual records.'
- NetworkPolicy: default deny, explicit allow for required traffic paths
- ClusterIP by default — Ingress/Gateway for external traffic with TLS
- One Service per Deployment — matching label selectors
- FQDN for cross-namespace: service.namespace.svc.cluster.local
- Ingress with TLS — cert-manager for automatic certificate management
Without NetworkPolicy, every pod can talk to every other pod. Default deny + explicit allow prevents lateral movement if one pod is compromised — the network equivalent of least privilege.
Complete Kubernetes Rules Template
Consolidated rules for Kubernetes manifests. These apply to any K8s distribution — EKS, GKE, AKS, or self-managed.
- Resource requests AND limits on every container — memory requests = limits
- Security context: non-root, read-only FS, drop all caps, no privilege escalation
- livenessProbe + readinessProbe + startupProbe (slow apps) on every deployment
- RollingUpdate with maxUnavailable: 0 — PodDisruptionBudget with minAvailable: 1
- Minimum 2 replicas + HPA for production — never single replica
- NetworkPolicy: default deny, explicit allow per traffic path
- Dedicated service accounts — RBAC with least privilege
- kube-linter or Polaris in CI — fail on security and best practice violations