Intermediate

Model Training

Master experiment tracking, hyperparameter tuning, distributed training, and model versioning for production ML.

Experiment Tracking

Experiment tracking is the practice of recording everything about your ML experiments: parameters, metrics, code, data, and artifacts. Without it, you can't compare runs, reproduce results, or collaborate effectively.

MLflow Tracking

Python — Experiment tracking with MLflow
import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("customer-churn")

with mlflow.start_run(run_name="rf-baseline"):
    # Log parameters
    params = {"n_estimators": 100, "max_depth": 10, "min_samples_split": 5}
    mlflow.log_params(params)

    # Train model
    model = RandomForestClassifier(**params, random_state=42)
    model.fit(X_train, y_train)

    # Log metrics
    y_pred = model.predict(X_test)
    mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
    mlflow.log_metric("f1_score", f1_score(y_test, y_pred))

    # Log model
    mlflow.sklearn.log_model(model, "model")

    # Log artifacts (plots, data samples, etc.)
    mlflow.log_artifact("confusion_matrix.png")

Weights & Biases (W&B)

Python — Experiment tracking with W&B
import wandb

wandb.init(project="customer-churn", name="rf-baseline", config={
    "n_estimators": 100,
    "max_depth": 10,
    "min_samples_split": 5,
})

# Training loop (for deep learning)
for epoch in range(num_epochs):
    train_loss = train_one_epoch(model, train_loader)
    val_loss, val_acc = evaluate(model, val_loader)

    wandb.log({
        "train_loss": train_loss,
        "val_loss": val_loss,
        "val_accuracy": val_acc,
        "epoch": epoch,
    })

wandb.finish()
ToolTypeKey StrengthBest For
MLflowOpen-sourceFull lifecycle, model registryTeams wanting open-source, self-hosted
W&BSaaS / Self-hostedVisualization, collaborationDeep learning teams, research
NeptuneSaaSMetadata management, scalabilityLarge teams, production ML
CometMLSaaS / Self-hostedCode tracking, diff comparisonTeams wanting code reproducibility

Hyperparameter Tuning

Systematically search for the best hyperparameter configuration:

Optuna

Python — Hyperparameter tuning with Optuna
import optuna
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

def objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 500),
        "max_depth": trial.suggest_int("max_depth", 3, 15),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "min_samples_split": trial.suggest_int("min_samples_split", 2, 20),
    }

    model = GradientBoostingClassifier(**params, random_state=42)
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring="f1")
    return scores.mean()

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

print(f"Best F1: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

Ray Tune

Python — Distributed tuning with Ray Tune
from ray import tune
from ray.tune.schedulers import ASHAScheduler

search_space = {
    "lr": tune.loguniform(1e-4, 1e-1),
    "batch_size": tune.choice([16, 32, 64, 128]),
    "hidden_size": tune.choice([64, 128, 256]),
    "num_layers": tune.randint(1, 5),
}

scheduler = ASHAScheduler(max_t=100, grace_period=10, reduction_factor=2)

result = tune.run(
    train_model,
    config=search_space,
    num_samples=50,
    scheduler=scheduler,
    resources_per_trial={"cpu": 2, "gpu": 1},
)

Distributed Training

When datasets or models are too large for a single machine, distributed training splits the work across multiple GPUs or nodes.

💡
Types of parallelism:
  • Data parallelism: Same model on each GPU, different data batches. Gradients are averaged.
  • Model parallelism: Different parts of the model on different GPUs. For models that don't fit on one GPU.
  • Pipeline parallelism: Model split into stages, micro-batches flow through the pipeline.
Python — PyTorch Distributed Data Parallel
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def train(rank, world_size):
    setup(rank, world_size)

    model = MyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    optimizer = torch.optim.Adam(ddp_model.parameters(), lr=1e-3)

    for epoch in range(num_epochs):
        for batch in train_loader:
            optimizer.zero_grad()
            loss = ddp_model(batch)
            loss.backward()
            optimizer.step()

# Launch: torchrun --nproc_per_node=4 train.py

GPU Management

Efficient GPU management is crucial for cost and performance:

  • GPU utilization monitoring: Use nvidia-smi, GPU dashboards, or tools like gpustat.
  • Mixed precision training: Use FP16/BF16 to reduce memory and increase throughput (2-3x speedup).
  • Gradient accumulation: Simulate larger batch sizes without more GPU memory.
  • GPU scheduling: Use Kubernetes GPU operators or SLURM for multi-user GPU clusters.

Model Registry

A model registry is a central hub for managing model versions, stages, and metadata:

  1. Register

    After training, register the model with its metrics, parameters, and artifacts.

  2. Version

    Each registration creates a new version. Track lineage back to data and code.

  3. Stage

    Move models through stages: None → Staging → Production → Archived.

  4. Approve

    Require human or automated approval before promotion to production.

Reproducibility

Python — Ensuring reproducibility
import random
import numpy as np
import torch

def set_seed(seed=42):
    """Set all random seeds for reproducibility."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(42)

# Also pin your environment:
# pip freeze > requirements.txt
# or use: conda env export > environment.yml
Note: Perfect reproducibility across different hardware (e.g., different GPU models) is extremely difficult due to floating-point non-determinism in parallel operations. Document the hardware used for each experiment.