AI Coding for ML Engineers

Why ML Engineers Need AI Coding Rules

You are an ML engineer. You are not a data scientist (you do not explore data) and you are not a software engineer (you do not build user-facing features). You build the systems that take ML models from research to production: training pipelines, inference servers, feature stores, model registries, and monitoring dashboards. Your code: runs at scale (training on 8 GPUs, serving 10,000 requests per second) and must be reproducible (retrain the same model and get the same results). Without AI rules: each ML engineer builds pipelines differently. Training code that works on one engineer's machine: fails on the cluster because of a hardcoded path or an unpinned dependency.

With AI rules: the AI generates production-grade ML engineering code by default. AI rule: 'All training scripts use the standard TrainingConfig dataclass for hyperparameters. All experiments log to MLflow with run tags (model, dataset, engineer). All model artifacts include config, metrics, requirements.txt, and a data fingerprint. All inference servers use the InferenceService base class with standard health and predict endpoints.' Every AI-generated pipeline: reproducible. Every model: traceable. Every inference server: consistent.

The ML engineering-specific benefit: ML systems have more moving parts than traditional software. Training depends on: data version, code version, library versions, hardware configuration, random seeds, and hyperparameters. If any of these change untracked: reproducibility is lost. AI rules: ensure every moving part is tracked by default. The ML engineer: focuses on model improvement, not on chasing down why last week's training run cannot be reproduced.

How AI Rules Standardize Training Pipelines

Training configuration: AI rule: 'All hyperparameters defined in a TrainingConfig dataclass (not scattered across argparse arguments). Config serialized to YAML and logged with every training run. Config includes: model architecture, optimizer, learning rate schedule, batch size, max epochs, early stopping patience, and data paths. No magic numbers in training code — all values from config.' The AI: generates self-documenting training scripts. The ML engineer: reproduces any experiment by loading its config file.

Data versioning: AI rule: 'All training data referenced by version (DVC hash, S3 version ID, or dataset registry version). Training scripts log the data version with the experiment. Never reference data by mutable paths (no /data/latest/ — use /data/v2.3.1/). Data preprocessing scripts: versioned alongside training scripts (a model is only reproducible if its data processing is also reproducible).' The AI: generates data-version-aware training code. The ML engineer: can trace any model's predictions back to the exact data it was trained on.

Distributed training conventions: AI rule: 'All training scripts support single-GPU and multi-GPU execution without code changes (use PyTorch DistributedDataParallel). Training scripts detect hardware automatically (GPU count, GPU memory). Batch size scales with GPU count (effective batch = batch_per_gpu * gpu_count). Gradient accumulation for large effective batch sizes on limited hardware.' The AI: generates hardware-adaptive training code. The ML engineer: runs the same script on a laptop (1 GPU) and the cluster (8 GPUs) without modification. AI rule: 'Training pipeline rules solve the reproducibility crisis in ML. A model that cannot be reproduced: cannot be improved (you do not know what changed). Cannot be audited (you cannot explain its behavior). Cannot be trusted (you cannot verify its performance). AI rules: make reproducibility the default, not a post-hoc effort.'

💡 TrainingConfig Dataclass Solves the 'Cannot Reproduce' Problem

ML engineer tries to reproduce last month's training run. Questions: What learning rate? What batch size? What data version? What early stopping patience? If hyperparameters were scattered across argparse arguments, environment variables, and hardcoded values: the answers are lost. AI rule: 'All hyperparameters in a TrainingConfig dataclass, serialized to YAML, logged with every run.' With one dataclass: every training run is fully specified. Reproduction: load the YAML, run the script. No archaeology through git history. No asking 'what settings did you use?' One pattern: solves the reproducibility crisis.

AI Rules for Efficient Inference Serving

Inference server structure: AI rule: 'All inference servers extend InferenceService with three endpoints: GET /health (returns model version and status), POST /predict (accepts batch inputs, returns predictions with latencies), GET /metrics (Prometheus format). Input validation with pydantic models. Output includes prediction, confidence, model version, and latency_ms.' The AI: generates consistent inference servers. The platform team: deploys any model using the same infrastructure. The monitoring system: works across all models because the endpoint contract is identical.

Batching and optimization: AI rule: 'Inference servers support dynamic batching (collect requests within a 10ms window, batch-predict, return individual responses). Model loaded once at startup, not per-request. Use ONNX Runtime or TorchScript for optimized inference. GPU memory: pre-allocate based on maximum batch size. Maximum inference latency: 100ms p99 for real-time, 5s p99 for batch.' The AI: generates latency-aware inference code. The ML engineer: does not need to implement batching from scratch — the rules encode the pattern.

Model loading and caching: AI rule: 'Models loaded from the model registry by version tag (not from local file paths). Model caching: keep the current version and the previous version in memory (for instant rollback). Model updates: blue-green deployment (load new model, verify health, switch traffic). Never hot-swap models in-place (risk of serving with a partially loaded model).' The AI: generates safe model loading with rollback capability. The production system: updates models without downtime or serving errors. AI rule: 'Inference serving is where ML meets SLA. The model is only useful if it serves predictions within the latency budget. AI rules: encode the latency budget, batching strategy, and deployment pattern so every model meets the SLA from day one.'

ℹ️ Dynamic Batching Turns 10ms Latency Into 10ms Latency at 100x Throughput

Single prediction: 10ms on GPU (most of that is GPU overhead, not computation). Batch of 32 predictions: 12ms on GPU (the computation parallelizes, only 2ms marginal). Without dynamic batching: 32 sequential requests take 320ms total. With dynamic batching (collect 10ms window, batch-predict): 32 requests take 12ms total. Same latency per request. 27x better throughput. AI rule: 'All inference servers support dynamic batching.' One pattern: transforms GPU utilization from 3% (single predictions) to 80%+ (batched predictions).

AI Rules for Model Monitoring and Lifecycle

Prediction monitoring: AI rule: 'All prediction endpoints log: input hash, prediction, confidence, latency, model version, and timestamp. Log sampling: 100% for low-traffic models (<100 rps), 10% for high-traffic models (>100 rps). Anomaly detection: alert when prediction distribution shifts more than 2 standard deviations from the training distribution (data drift indicator).' The AI: generates monitored inference servers. The ML engineer: detects model degradation before it affects business metrics.

Model lifecycle management: AI rule: 'Models in the registry have lifecycle stages: development, staging, production, archived. Promotion requires: passing all validation tests, latency benchmark within SLA, and approval from the model owner. Rollback: automated on error rate exceeding 2x baseline. Archival: models not served for 90 days are archived (removed from active serving infrastructure but retained in the registry for audit).' The AI: generates lifecycle-aware model code with proper stage metadata. The ML platform: manages hundreds of models with consistent lifecycle policies.

A/B testing and canary deployment: AI rule: 'New models deployed as canary (5% traffic) for 24 hours before full rollout. A/B tests compare models on the same traffic split with statistical significance testing (minimum 1000 predictions per variant before evaluation). A/B test results logged to MLflow with the experiment ID linking the training run to the serving evaluation.' The AI: generates production-safe model deployment code. The ML engineer: validates models in production without risking the full traffic. AI rule: 'Model monitoring is not observability — it is the core feedback loop of ML engineering. A model without monitoring: improves only when someone manually checks performance. A model with monitoring: improves continuously because degradation is detected automatically and triggers retraining. AI rules: close the feedback loop by default.'

⚠️ A Model Without Monitoring Degrades Silently Until Someone Notices

Model deployed Monday. Performs well. Data distribution shifts gradually (seasonal change, new user demographics, feature drift). By Friday: model accuracy dropped 15%. Nobody noticed because there was no monitoring. Monday meeting: 'Why did conversion drop last week?' The model degraded — but the alert never fired because there was no alert. AI rule: 'Prediction logging with drift detection. Alert when distribution shifts > 2 standard deviations.' With monitoring: degradation detected within hours, not weeks. Retraining triggered automatically. The silent failure: eliminated by one monitoring rule.

ML Engineer Quick Reference for AI Coding

Quick reference for ML engineers using AI coding tools.

Core benefit: AI rules ensure reproducible training, consistent inference, and automated monitoring across all models
Training config: dataclass with all hyperparameters, serialized to YAML, logged with every experiment run
Data versioning: DVC hashes or registry versions, never mutable paths — full data lineage for every model
Distributed training: auto-detect hardware, scale batch size, gradient accumulation — same script on 1 or 8 GPUs
Inference: InferenceService base class, standard health/predict/metrics endpoints, dynamic batching
Model loading: registry-based versioning, blue-green deployment, previous version cached for instant rollback
Monitoring: prediction logging with drift detection, automated rollback on error rate spike, 24h canary deploys
Lifecycle: development → staging → production → archived, promotion requires validation and approval