Model Training
Master experiment tracking, hyperparameter tuning, distributed training, and model versioning for production ML.
Experiment Tracking
Experiment tracking is the practice of recording everything about your ML experiments: parameters, metrics, code, data, and artifacts. Without it, you can't compare runs, reproduce results, or collaborate effectively.
MLflow Tracking
import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("customer-churn")
with mlflow.start_run(run_name="rf-baseline"):
# Log parameters
params = {"n_estimators": 100, "max_depth": 10, "min_samples_split": 5}
mlflow.log_params(params)
# Train model
model = RandomForestClassifier(**params, random_state=42)
model.fit(X_train, y_train)
# Log metrics
y_pred = model.predict(X_test)
mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
mlflow.log_metric("f1_score", f1_score(y_test, y_pred))
# Log model
mlflow.sklearn.log_model(model, "model")
# Log artifacts (plots, data samples, etc.)
mlflow.log_artifact("confusion_matrix.png")
Weights & Biases (W&B)
import wandb
wandb.init(project="customer-churn", name="rf-baseline", config={
"n_estimators": 100,
"max_depth": 10,
"min_samples_split": 5,
})
# Training loop (for deep learning)
for epoch in range(num_epochs):
train_loss = train_one_epoch(model, train_loader)
val_loss, val_acc = evaluate(model, val_loader)
wandb.log({
"train_loss": train_loss,
"val_loss": val_loss,
"val_accuracy": val_acc,
"epoch": epoch,
})
wandb.finish()
| Tool | Type | Key Strength | Best For |
|---|---|---|---|
| MLflow | Open-source | Full lifecycle, model registry | Teams wanting open-source, self-hosted |
| W&B | SaaS / Self-hosted | Visualization, collaboration | Deep learning teams, research |
| Neptune | SaaS | Metadata management, scalability | Large teams, production ML |
| CometML | SaaS / Self-hosted | Code tracking, diff comparison | Teams wanting code reproducibility |
Hyperparameter Tuning
Systematically search for the best hyperparameter configuration:
Optuna
import optuna
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
def objective(trial):
params = {
"n_estimators": trial.suggest_int("n_estimators", 50, 500),
"max_depth": trial.suggest_int("max_depth", 3, 15),
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
"subsample": trial.suggest_float("subsample", 0.6, 1.0),
"min_samples_split": trial.suggest_int("min_samples_split", 2, 20),
}
model = GradientBoostingClassifier(**params, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=5, scoring="f1")
return scores.mean()
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
print(f"Best F1: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")
Ray Tune
from ray import tune
from ray.tune.schedulers import ASHAScheduler
search_space = {
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([16, 32, 64, 128]),
"hidden_size": tune.choice([64, 128, 256]),
"num_layers": tune.randint(1, 5),
}
scheduler = ASHAScheduler(max_t=100, grace_period=10, reduction_factor=2)
result = tune.run(
train_model,
config=search_space,
num_samples=50,
scheduler=scheduler,
resources_per_trial={"cpu": 2, "gpu": 1},
)
Distributed Training
When datasets or models are too large for a single machine, distributed training splits the work across multiple GPUs or nodes.
- Data parallelism: Same model on each GPU, different data batches. Gradients are averaged.
- Model parallelism: Different parts of the model on different GPUs. For models that don't fit on one GPU.
- Pipeline parallelism: Model split into stages, micro-batches flow through the pipeline.
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def setup(rank, world_size):
dist.init_process_group("nccl", rank=rank, world_size=world_size)
def train(rank, world_size):
setup(rank, world_size)
model = MyModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])
optimizer = torch.optim.Adam(ddp_model.parameters(), lr=1e-3)
for epoch in range(num_epochs):
for batch in train_loader:
optimizer.zero_grad()
loss = ddp_model(batch)
loss.backward()
optimizer.step()
# Launch: torchrun --nproc_per_node=4 train.py
GPU Management
Efficient GPU management is crucial for cost and performance:
- GPU utilization monitoring: Use
nvidia-smi, GPU dashboards, or tools likegpustat. - Mixed precision training: Use FP16/BF16 to reduce memory and increase throughput (2-3x speedup).
- Gradient accumulation: Simulate larger batch sizes without more GPU memory.
- GPU scheduling: Use Kubernetes GPU operators or SLURM for multi-user GPU clusters.
Model Registry
A model registry is a central hub for managing model versions, stages, and metadata:
Register
After training, register the model with its metrics, parameters, and artifacts.
Version
Each registration creates a new version. Track lineage back to data and code.
Stage
Move models through stages: None → Staging → Production → Archived.
Approve
Require human or automated approval before promotion to production.
Reproducibility
import random
import numpy as np
import torch
def set_seed(seed=42):
"""Set all random seeds for reproducibility."""
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
set_seed(42)
# Also pin your environment:
# pip freeze > requirements.txt
# or use: conda env export > environment.yml
Lilly Tech Systems