ML Systems: Code + Data + Models
ML engineering differs from traditional software engineering: the output is not just code — it is code + data + trained models. Traditional software: the code is the artifact. Deploy the same code, get the same behavior. ML systems: the model is the artifact. The same training code with different data produces a different model with different behavior. AI rules for ML engineering must govern: the code (training scripts, serving infrastructure), the data (datasets, feature pipelines), and the models (versioning, evaluation, deployment).
The ML engineering stack: experiment tracking (MLflow, Weights & Biases, Neptune), feature stores (Feast, Tecton, Hopsworks), training orchestration (Kubeflow, SageMaker, Vertex AI), model registry (MLflow Model Registry, SageMaker Model Registry), and serving infrastructure (TensorFlow Serving, Triton, BentoML). AI rule: 'Detect the project's ML stack. Generate code that follows the stack's conventions. MLflow project: log parameters, metrics, and artifacts with the MLflow API. Kubeflow project: generate pipeline components with the Kubeflow SDK.'
The core ML engineering AI rules: every experiment must be tracked and reproducible, every model must be versioned and evaluable, every feature must be defined once and reused (feature store), and every deployment must be monitored for model drift.
Experiment Tracking and Reproducibility
Experiment tracking: every training run must log: parameters (hyperparameters, data version, feature set), metrics (loss, accuracy, F1, AUC — per epoch and final), artifacts (trained model, evaluation plots, confusion matrix), environment (Python version, library versions, GPU type), and git commit (which code produced this model). AI rule: 'Every training script: log all parameters, metrics, and artifacts to the experiment tracker. Never run training without tracking. The AI generates mlflow.log_param(), mlflow.log_metric(), and mlflow.log_artifact() calls alongside training code.'
Reproducibility: given the same code, data, and random seed, training should produce the same model. AI rule: 'Set random seeds for all sources of randomness (Python random, NumPy, PyTorch/TensorFlow). Pin all library versions (requirements.txt or poetry.lock). Log the data version (hash of the dataset or version from the data versioning tool). The AI generates reproducibility setup at the start of every training script.'
Experiment comparison: the experiment tracker should enable comparing runs. AI rule: 'Log metrics consistently across experiments: same metric names, same evaluation datasets, same evaluation methodology. This enables: comparing models in the experiment tracker UI, automated model selection (pick the model with the best metric on the eval set), and trend analysis (are models improving over time?).'
A data scientist trains 50 models in a Jupyter notebook, finds the best one, but did not track parameters. Three months later: which hyperparameters produced that model? What data version was used? What library versions? Without experiment tracking: the knowledge is lost. The model cannot be reproduced. The AI must generate experiment tracking calls in every training script — this is the ML equivalent of version control for code.
Feature Engineering and Model Versioning
Feature store: features should be defined once and reused across training and serving. Training-serving skew (features computed differently in training vs serving) is a major source of ML bugs. AI rule: 'Define features in the feature store (Feast entity + feature view). Training: read features from the feature store's offline store (historical data). Serving: read features from the online store (real-time). Same feature definition for both. Never compute features differently in training and serving scripts.'
Model versioning: every trained model is registered in the model registry with: version number, training experiment ID (link to experiment tracker), evaluation metrics, model artifact location, and status (staging, production, archived). AI rule: 'After training: register the model in the registry. Include: experiment link, metrics, and the training data version. Promote to staging for evaluation. Promote to production after evaluation passes. Archive old versions but do not delete (for rollback).'
Model evaluation: before deploying a new model, evaluate against the current production model on a held-out test set. AI rule: 'Model promotion to production requires: evaluation on the standard test set, comparison against the current production model (is the new model better?), and evaluation on fairness metrics (does the model perform equitably across demographic groups?). The AI generates evaluation scripts that run as part of the promotion workflow.'
Training-serving skew is the #1 cause of ML models performing worse in production than in evaluation. The model was trained with features computed one way (batch, historical data), but served with features computed a different way (real-time, different code path). Feature stores solve this: define the feature once, compute it consistently for both training (offline) and serving (online). The AI should always use the feature store for feature access, never compute features ad-hoc in training or serving scripts.
Model Serving and Monitoring
Model serving patterns: real-time (API endpoint for individual predictions, latency < 100ms), batch (score a dataset on a schedule, latency in minutes/hours), and streaming (process events in real-time, scoring as events arrive). AI rule: 'Match the serving pattern to the use case. Real-time: use a model server (TensorFlow Serving, Triton, BentoML) with the production model loaded. Batch: use the training framework to score the dataset. Streaming: embed the model in the stream processor (Flink, Spark Streaming).'
Model monitoring: deployed models degrade over time as the data distribution shifts (model drift). AI rule: 'Monitor in production: prediction distribution (are predictions shifting?), feature distribution (are input features changing?), ground truth comparison (when labels are available, compare predictions to actuals), and latency/throughput (performance metrics). Alert when drift exceeds thresholds. Automated retraining: trigger when drift is detected.'
A/B testing: compare a new model against the current model in production with real traffic. AI rule: 'A/B testing infrastructure: route a percentage of traffic to the new model (canary), measure business metrics (click-through rate, conversion, revenue — not just ML metrics), and promote or rollback based on business impact. The AI generates the traffic routing configuration and metric collection alongside the model deployment.'
A new model has 2% better accuracy on the test set. Should it be deployed? Not necessarily. What matters: does it improve business metrics (conversion rate, revenue, user satisfaction)? A model with better accuracy might recommend products that users click but do not buy (high accuracy, low revenue). A/B testing with business metrics is the only way to know if a model improvement translates to business value. The AI should generate A/B test configurations that measure business metrics, not just ML metrics.
ML Engineering AI Rules Summary
Summary of AI rules for ML engineering teams building and deploying machine learning systems.
- Experiment tracking: log params, metrics, artifacts, environment, and git commit for every run
- Reproducibility: random seeds, pinned versions, data versioning. Same inputs = same model
- Feature store: define features once. Same definition for training and serving. Prevent training-serving skew
- Model registry: version, experiment link, metrics, status (staging/production/archived)
- Evaluation: compare against production model. Fairness metrics. Must pass before promotion
- Serving: real-time (model server), batch (scheduled scoring), streaming (embedded in processor)
- Monitoring: prediction drift, feature drift, ground truth comparison. Alert on threshold breach
- A/B testing: canary traffic, business metrics (not just ML metrics), promote or rollback