AI Rules for Infrastructure Teams

Infrastructure: The Foundation Everything Runs On

Infrastructure teams manage the cloud resources that applications depend on: compute (EC2, GCE, AKS), networking (VPCs, subnets, load balancers, DNS), storage (S3, GCS, EBS), databases (RDS, Cloud SQL, DynamoDB), and supporting services (queues, caches, CDNs). The infrastructure team's decisions affect: cost (cloud bills can spiral without governance), reliability (misconfigured networking = outages), security (exposed resources = breaches), and performance (wrong instance types = slow applications).

Infrastructure AI rules encode: cloud architecture patterns (Well-Architected Framework), cost optimization (right-sizing, reserved instances, spot/preemptible), networking standards (VPC layout, security groups, CIDR allocation), and operational standards (tagging, monitoring, backup, DR). The AI generating infrastructure code must follow these patterns or risk creating: expensive, insecure, or unreliable infrastructure.

The infrastructure team produces AI rules at two levels: infrastructure-specific rules (Terraform module conventions, networking standards, resource naming) and organization-wide rules (cost tagging required on all resources, encryption at rest required, public access blocked by default).

Cloud Architecture Patterns

Well-Architected Framework (AWS/GCP/Azure all have versions): five pillars that infrastructure AI rules should encode. Operational Excellence: infrastructure as code, automated deployment, monitoring. Security: encryption, least privilege, network segmentation. Reliability: multi-AZ, auto-scaling, backup and DR. Performance Efficiency: right instance types, caching, CDN. Cost Optimization: right-sizing, reserved capacity, tagging.

Networking foundation: AI rule: 'VPC design: /16 CIDR for the VPC, /24 subnets for each availability zone. Public subnets (load balancers, NAT gateways), private subnets (application servers, databases). No direct internet access for private subnets — use NAT gateways. Security groups: deny all inbound by default, allow only required ports from required sources. The AI generates networking with defense-in-depth: multiple layers of access control.'

Multi-AZ by default: AI rule: 'All production workloads: deployed across at least 2 availability zones. Load balancers: multi-AZ. Databases: multi-AZ replicas. Auto-scaling groups: span multiple AZs. Single-AZ is acceptable only for development environments. The AI generates multi-AZ configurations for all production resources.'

⚠️ No Direct Internet Access for Private Subnets

A database in a private subnet with a public IP: accessible from the entire internet. One misconfigured security group: anyone can connect. The rule: private subnets have no internet gateway route. Outbound internet (for updates, API calls): through a NAT gateway. Inbound: only from application servers in the same VPC through security groups. The AI must never generate public IPs or internet gateway routes for databases, caches, or application servers.

Cost Optimization and Resource Management

Resource tagging: AI rule: 'Every cloud resource must have tags: team (owning team), service (which service uses it), environment (dev/staging/prod), cost-center (for billing allocation), and managed-by (terraform/manual). Resources without tags: flagged for review. The AI generates tags on every resource it creates. Tagging enables: cost allocation per team, unused resource identification, and automated governance.'

Right-sizing: AI rule: 'Start with smaller instances and scale up based on monitoring data. Do not default to large instances. Development: smallest viable instance. Staging: match production instance type at reduced count. Production: size based on load testing results. The AI should generate resource specifications with comments explaining the sizing rationale.'

Cost alerts and budgets: AI rule: 'Set budget alerts at 80% and 100% of expected monthly cost per team/service. Automated notification to the team when the budget is approaching or exceeded. Generate cost anomaly detection: alert when daily spend increases by more than 20% compared to the trailing 7-day average. This catches runaway costs (forgotten resources, auto-scaling gone wrong) before the bill arrives.'

💡 Tags Are Your Cost Control Mechanism

Without tags: the monthly cloud bill is one big number. No one knows which team, service, or environment is responsible. With tags: filter costs by team (Team A: $5,000/mo, Team B: $12,000/mo), by service (payments: $3,000, analytics: $8,000), and by environment (prod: $15,000, dev: $2,000). Untagged resources: highlighted for cleanup. Tags turn an opaque bill into an actionable cost report. The AI must add tags to every resource it creates.

Disaster Recovery and Operations

Disaster recovery tiers: AI rule: 'Define the DR tier per service based on business criticality. Backup and Restore (RPO: hours, RTO: hours): daily backups, restore when needed. Pilot Light (RPO: minutes, RTO: minutes): minimal infrastructure always running in DR region, scale up when needed. Warm Standby (RPO: seconds, RTO: minutes): scaled-down version running in DR region. Active-Active (RPO: zero, RTO: zero): full infrastructure in both regions. The DR tier determines the infrastructure the AI generates.'

Backup standards: AI rule: 'All databases: automated daily backups with 30-day retention. Critical databases: point-in-time recovery enabled (RPO: minutes). All backups: encrypted at rest. Backup restoration: tested monthly (untested backups are not backups). Cross-region backup copies: for critical services. The AI generates backup configuration for every database and storage resource it creates.'

Infrastructure drift detection: AI rule: 'Monitor for infrastructure drift (manual changes that deviate from the Terraform state). Tools: Terraform plan in CI (detect planned changes), AWS Config or cloud-native drift detection (detect unplanned changes). Alert on drift. Remediate by applying Terraform (overwrite manual changes) or updating Terraform to match (incorporate intentional manual changes). The AI generates drift detection alongside IaC.'

ℹ️ Untested Backups Are Not Backups

A backup that has never been restored might be: corrupted, missing data, incompatible with the current schema, or stored in a format that takes 12 hours to restore (when the RTO is 1 hour). Monthly restore testing verifies: the backup is complete, the restore process works, and the recovery time meets the RTO. Document the test results — the infrastructure team and auditors both need evidence that DR works.

Infrastructure AI Rules Summary

Summary of AI rules for infrastructure teams managing cloud resources and networking.

Well-Architected: operational excellence, security, reliability, performance, cost optimization
Networking: VPC /16, subnets /24, public/private separation, security groups deny-by-default
Multi-AZ: all production across 2+ AZs. Single-AZ only for development
Tagging: team, service, environment, cost-center, managed-by on every resource
Right-sizing: start small, scale based on monitoring. Comments explaining sizing rationale
Cost alerts: 80%/100% budget alerts. 20% daily anomaly detection. Per-team cost allocation
DR: tier per service (backup/pilot/warm/active-active). Monthly backup restore testing
Drift detection: Terraform plan in CI. Cloud-native drift monitoring. Alert and remediate