Advanced

Best Practices & Checklist

Everything you need to review before launching a training pipeline in production. This lesson consolidates the entire course into actionable checklists, cost optimization strategies, debugging guides, and answers to the most common questions engineers have about production ML training.

Production Training Pipeline Checklist

Use this checklist every time you build or modify a training pipeline. Each item maps to a lesson in this course.

Data Layer

Data validation runs before training — schema checks, volume checks, drift detection (Lesson 2)
Data is versioned — every training run links to a specific, immutable dataset snapshot
Splits are deterministic and time-aware — no future data leaking into training set
Class imbalance is handled — class weights, focal loss, or oversampling strategy in place
Data pipeline has monitoring — alerts on volume drops, schema changes, or missing data

Training Layer

Training is reproducible — git commit, data version, environment, and seeds are logged (Lesson 1)
Checkpointing is enabled — save every N epochs/steps so you can resume after failures
Distributed training is configured correctly — DDP for single-node, FSDP/DeepSpeed for large models (Lesson 3)
Mixed precision is enabled — BF16 or FP16 for 2x throughput improvement
Gradient clipping is set — prevents training instability from gradient explosions
Learning rate schedule includes warmup — especially important for distributed training

Tracking Layer

All experiments are tracked — params, metrics, artifacts logged to MLflow/W&B (Lesson 4)
Model registry is configured — staging/production stages with promotion gates
Evaluation compares against baseline — every new model is tested against current production model

Infrastructure Layer

GPU scheduling uses a batch scheduler — Kueue or Volcano, not default K8s scheduler (Lesson 5)
Resource quotas are set per team — guaranteed allocation + borrowing + preemption
Cost tracking is in place — per-team, per-job GPU cost reporting
Spot instances are used for experiments — with checkpointing every 15-30 minutes

Deployment Layer

CI/CD pipeline automates the full lifecycle — from data validation to deployment (Lesson 6)
Quality gates block bad models — accuracy, latency, fairness, model size checks
Rollback is tested and automated — previous model version loads in under 60 seconds
Canary deployment is configured — 5% traffic test before full rollout

Cost Optimization Strategies

Training costs can spiral quickly. Here are the most impactful cost reduction strategies, ranked by typical savings.

Strategy	Typical Savings	Effort	Trade-offs
Spot/preemptible instances	60-70%	Low (add checkpointing)	Jobs may be interrupted; need good resume logic
Mixed precision (BF16)	40-50% (2x throughput)	Low (one line of code)	Rare numerical issues with some architectures
Right-size GPU selection	30-50%	Medium (profiling)	Requires understanding memory/compute needs
Gradient accumulation	20-40% (fewer GPUs)	Low	Slower per-step; equivalent to larger batch size
Early stopping	20-50% (fewer epochs)	Low	Must monitor correctly to avoid underfitting
Efficient data loading	10-30% (less GPU idle time)	Medium	Requires profiling DataLoader bottlenecks
Reserved instances	30-40% (1-year commit)	None (procurement)	Capital commitment; less flexibility

# Quick wins: cost optimization in training code

# 1. Mixed precision - 2x throughput for free
scaler = torch.amp.GradScaler("cuda")
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
    loss = model(**batch).loss

# 2. Gradient accumulation - train with large effective batch on fewer GPUs
accumulation_steps = 4  # Simulates 4x the batch size
for i, batch in enumerate(loader):
    with torch.amp.autocast("cuda", dtype=torch.bfloat16):
        loss = model(**batch).loss / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

# 3. torch.compile - 10-30% speedup with no code changes
model = torch.compile(model, mode="max-autotune")

# 4. Efficient DataLoader - prevent GPU idle waiting for data
loader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=8,         # Match CPU cores available
    pin_memory=True,       # Faster CPU->GPU transfer
    prefetch_factor=2,     # Prefetch 2 batches per worker
    persistent_workers=True,  # Don't respawn workers each epoch
)

# 5. Checkpoint periodically for spot instance resilience
if step % 500 == 0:
    torch.save({
        "step": step,
        "model": model.state_dict(),
        "optimizer": optimizer.state_dict(),
        "scaler": scaler.state_dict(),
    }, f"s3://checkpoints/run-{run_id}/step-{step}.pt")

Debugging Training Failures

Training failures are inevitable. Here is a systematic debugging guide for the most common issues.

Symptom	Likely Cause	Fix
Loss is NaN	Learning rate too high, bad data (NaN/Inf in features), or numeric overflow	Reduce LR by 10x, add data validation, enable gradient clipping, check for division by zero
Loss plateaus immediately	Learning rate too low, frozen layers, or broken data pipeline returning same batch	Increase LR, verify requires_grad=True, add data logging to confirm batch variety
CUDA OOM	Batch size too large, model too large, or memory leak	Reduce batch size, use gradient accumulation, enable FSDP, check for tensor accumulation in lists
DDP hangs	One rank crashed silently, or mismatched forward pass (some ranks skip a branch)	Set NCCL_DEBUG=INFO, add timeout to dist.barrier(), ensure all ranks execute same operations
Training slow	Data loading bottleneck, GPU underutilized, or excessive logging	Profile with torch.profiler, increase num_workers, reduce logging frequency
Val loss goes up	Overfitting, data leakage in splits, or LR too high late in training	Add regularization (dropout, weight decay), verify split integrity, add LR decay

# Debug toolkit - add to every training script

# 1. Detect NaN/Inf early
torch.autograd.set_detect_anomaly(True)  # Slow but catches NaN source

# 2. Monitor GPU memory
def log_gpu_memory(step):
    allocated = torch.cuda.memory_allocated() / 1e9
    reserved = torch.cuda.memory_reserved() / 1e9
    max_allocated = torch.cuda.max_memory_allocated() / 1e9
    print(f"[Step {step}] GPU Memory: allocated={allocated:.1f}GB, reserved={reserved:.1f}GB, peak={max_allocated:.1f}GB")

# 3. Profile training bottlenecks
from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
    on_trace_ready=tensorboard_trace_handler("./profiler_logs"),
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
) as prof:
    for step, batch in enumerate(loader):
        if step >= 10:
            break
        loss = model(**batch).loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        prof.step()

# View with: tensorboard --logdir=./profiler_logs

💡

The 80/20 rule of debugging: 80% of training failures come from data issues (wrong format, NaN values, label errors), not model or infrastructure issues. Always check your data first. Add a sanity check that trains on 100 samples — if the model cannot overfit a tiny dataset, the problem is in your code, not your data or hyperparameters.

Frequently Asked Questions

How often should I retrain my model?

It depends on how fast your data distribution changes. For recommendation systems and ad models, daily or weekly retraining is common because user behavior shifts rapidly. For fraud detection, weekly to monthly is typical. For NLP models on stable domains, monthly to quarterly. The right answer is: retrain when your monitoring shows performance degradation. Start with weekly, measure drift, and adjust. Never retrain more often than your data meaningfully changes.

DDP vs FSDP vs DeepSpeed — which should I use?

DDP if your model fits on a single GPU (most models under 3B parameters on A100 80GB). It is the simplest and most efficient option. FSDP if you are using PyTorch and your model does not fit on one GPU. It is now the recommended PyTorch-native solution. DeepSpeed if you need ZeRO-Offload (CPU offloading), have very large models (70B+), or are using the HuggingFace ecosystem which has deep DeepSpeed integration. In practice, FSDP and DeepSpeed ZeRO-2/3 achieve similar performance for most workloads.

How do I handle checkpointing for spot instances?

Save checkpoints to durable storage (S3, GCS) every 15-30 minutes. Each checkpoint should include: model state, optimizer state, learning rate scheduler state, current epoch/step, random number generator states, and the data loader position. On restart, load the latest checkpoint and resume. With FSDP/DeepSpeed, use their built-in distributed checkpointing which handles sharded state across GPUs. Budget 2-5 minutes per checkpoint for large models, and ensure your checkpoint frequency is shorter than the average spot reclamation interval (~2 hours on AWS).

What is the most cost-effective GPU for training in 2026?

For fine-tuning and models under 7B: NVIDIA L4 or A10G on spot instances — best price/performance for smaller workloads. For full training of 7B-13B models: A100 80GB remains the sweet spot due to mature ecosystem and competitive spot pricing. For large-scale training (70B+): H100 80GB is worth the premium due to 3x throughput over A100. For budget-constrained teams: AMD MI250X on cloud providers that support it (30-40% cheaper than equivalent NVIDIA). Always benchmark your specific workload — raw TFLOPS do not tell the whole story.

How do I prevent data leakage in my pipeline?

The most common data leakage sources: (1) Temporal leakage — using future data to predict the past. Always split by time, not randomly. (2) Target leakage — a feature that directly encodes the label (e.g., including "refund_amount" when predicting if a transaction is fraudulent). Remove features derived from the target. (3) Group leakage — same user/entity appearing in both train and test sets. Use GroupKFold or StratifiedGroupKFold. (4) Preprocessing leakage — fitting normalizers or encoders on the full dataset including test data. Always fit on train only, transform on test. Add assertions in your pipeline that verify no overlap between splits.

Should I use Airflow, Kubeflow, Prefect, or Dagster?

Use Airflow if your company already uses it for data engineering and you want one orchestrator for everything. It works but is not ML-native. Use Kubeflow Pipelines if you are all-in on Kubernetes and GCP (Vertex AI), and need built-in GPU scheduling. The learning curve is steep. Use Prefect if you want the fastest time to production with the least complexity — best for small to medium ML teams. Use Dagster if you care about data lineage and asset-centric workflows — best when your ML pipeline is tightly coupled with data engineering. Start with the simplest option that meets your needs.

How do I calculate my training cost per model?

Formula: (num_GPUs x GPU_hourly_rate x training_hours) + (storage_GB x storage_rate) + (data_transfer_GB x transfer_rate). Example: Fine-tuning Llama-3-8B on 8x A100 spot instances for 4 hours = 8 x $1.10/hr x 4h = $35.20 compute + ~$2 storage = ~$37 total. For full pretraining of a 70B model: 256x H100 for 2 weeks = 256 x $3.75/hr x 336h = $322,560. Always track actual costs per run in your experiment tracker and compare against estimates. Include failed runs in your total cost calculation — they typically add 20-30% overhead.

Course Summary

You have completed the Designing ML Training Pipelines course. Here is a recap of the key concepts from each lesson:

Lesson	Key Takeaway
1. Pipeline Architecture	Move from notebooks to automated pipelines with 6 stages. Choose your orchestrator (Airflow, Kubeflow, Prefect, Dagster) based on team and infrastructure.
2. Data Preparation	Validate data before training (Great Expectations, TFDV). Use time-based splits. Handle imbalance with class weights or focal loss.
3. Distributed Training	Use DDP for models that fit on one GPU, FSDP/DeepSpeed for larger models. Expect 85-95% scaling efficiency on single-node NVLink.
4. Experiment Tracking	Track every run with MLflow or W&B. Use the model registry for staging-to-production promotion. Never skip this step.
5. GPU Scheduling	Use Kueue or Volcano for batch scheduling. Implement fair-share with quotas and preemption. Track GPU costs per team and job.
6. CI/CD for ML	Automate retraining with multiple triggers (schedule, drift, code change). Quality gates block bad models. Always have a tested rollback plan.
7. Best Practices	Follow the production checklist. Optimize costs with spot instances, mixed precision, and right-sized GPUs. Debug systematically — check data first.

💡

Keep learning: ML training infrastructure evolves rapidly. New hardware (H100, MI300X, Trainium), new parallelism strategies (ring attention, expert parallelism), and new orchestration tools emerge constantly. The fundamentals in this course — reproducibility, validation, distributed training, experiment tracking, scheduling, and automation — remain constant even as implementations change.

← Previous CI/CD for ML Back to Course → Course Overview