Best Practices
Follow industry best practices for experiment tracking, reproducibility, model management, monitoring, and avoiding common ML mistakes.
Experiment Tracking
Track every experiment systematically to compare results and reproduce successes:
- Log all hyperparameters, metrics, and artifacts for every run.
- Use tools like MLflow, Weights & Biases, or Neptune.ai.
- Tag experiments with meaningful names and descriptions.
- Save the exact code version (git commit hash) used for each experiment.
Reproducibility
import random import numpy as np import torch def set_seed(seed=42): """Set all random seeds for reproducibility.""" random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False set_seed(42) # Pin exact library versions # pip freeze > requirements.txt
Feature Store
Model Registry
A model registry tracks model versions through their lifecycle:
import mlflow # Register a model mlflow.register_model("runs:/<run_id>/model", "ProductionClassifier") # Transition model stages client = mlflow.tracking.MlflowClient() client.transition_model_version_stage( name="ProductionClassifier", version=3, stage="Production" )
A/B Testing Models
Before fully replacing a model in production, run A/B tests:
- Route a small percentage of traffic to the new model (canary deployment).
- Compare business metrics (not just ML metrics) between old and new models.
- Ensure statistical significance before declaring a winner.
- Monitor for unexpected side effects on downstream systems.
Monitoring Drift
| Type | What Changes | Detection |
|---|---|---|
| Data Drift | Input feature distributions | KS test, PSI, distribution plots |
| Concept Drift | Relationship between X and y | Monitor prediction accuracy over time |
| Label Drift | Target variable distribution | Track target distribution changes |
Common Mistakes
- Data leakage: Using test data information during training (e.g., fitting scaler on full dataset before splitting).
- Not stratifying: Random splits can create unbalanced folds for imbalanced datasets.
- Optimizing the wrong metric: Accuracy on imbalanced data is misleading.
- No baseline model: Always compare against a simple baseline (majority class, mean prediction).
- Ignoring feature importance: Understanding which features matter leads to better models.
- Not versioning data: Code versioning without data versioning breaks reproducibility.
Frequently Asked Questions
Use traditional ML (sklearn, XGBoost) for tabular data, small-to-medium datasets, and when interpretability matters. Use deep learning for images, text, audio, and very large datasets where feature engineering is impractical.
It depends on how fast your data distribution changes. E-commerce recommendations may need daily retraining, while fraud detection might retrain weekly. Monitor performance metrics and retrain when they degrade below your threshold.
PyTorch is more popular in research and increasingly in industry due to its Pythonic API and dynamic graphs. TensorFlow has stronger deployment tools (TF Serving, TFLite). For new projects, PyTorch is generally recommended unless you need specific TF ecosystem features.
Options include: oversampling the minority class (SMOTE), undersampling the majority class, using class weights in the loss function, using stratified sampling for train/test splits, or optimizing for F1/AUC instead of accuracy.
Data leakage occurs when information from the test set "leaks" into training. Common causes: fitting preprocessors on the full dataset, using future data for prediction, or including target-derived features. Prevention: always split data first, use sklearn Pipelines, and think carefully about what information would be available at prediction time.
Lilly Tech Systems