Best Practices & Checklist
Everything you need to review before launching a training pipeline in production. This lesson consolidates the entire course into actionable checklists, cost optimization strategies, debugging guides, and answers to the most common questions engineers have about production ML training.
Production Training Pipeline Checklist
Use this checklist every time you build or modify a training pipeline. Each item maps to a lesson in this course.
Data Layer
- Data validation runs before training — schema checks, volume checks, drift detection (Lesson 2)
- Data is versioned — every training run links to a specific, immutable dataset snapshot
- Splits are deterministic and time-aware — no future data leaking into training set
- Class imbalance is handled — class weights, focal loss, or oversampling strategy in place
- Data pipeline has monitoring — alerts on volume drops, schema changes, or missing data
Training Layer
- Training is reproducible — git commit, data version, environment, and seeds are logged (Lesson 1)
- Checkpointing is enabled — save every N epochs/steps so you can resume after failures
- Distributed training is configured correctly — DDP for single-node, FSDP/DeepSpeed for large models (Lesson 3)
- Mixed precision is enabled — BF16 or FP16 for 2x throughput improvement
- Gradient clipping is set — prevents training instability from gradient explosions
- Learning rate schedule includes warmup — especially important for distributed training
Tracking Layer
- All experiments are tracked — params, metrics, artifacts logged to MLflow/W&B (Lesson 4)
- Model registry is configured — staging/production stages with promotion gates
- Evaluation compares against baseline — every new model is tested against current production model
Infrastructure Layer
- GPU scheduling uses a batch scheduler — Kueue or Volcano, not default K8s scheduler (Lesson 5)
- Resource quotas are set per team — guaranteed allocation + borrowing + preemption
- Cost tracking is in place — per-team, per-job GPU cost reporting
- Spot instances are used for experiments — with checkpointing every 15-30 minutes
Deployment Layer
- CI/CD pipeline automates the full lifecycle — from data validation to deployment (Lesson 6)
- Quality gates block bad models — accuracy, latency, fairness, model size checks
- Rollback is tested and automated — previous model version loads in under 60 seconds
- Canary deployment is configured — 5% traffic test before full rollout
Cost Optimization Strategies
Training costs can spiral quickly. Here are the most impactful cost reduction strategies, ranked by typical savings.
| Strategy | Typical Savings | Effort | Trade-offs |
|---|---|---|---|
| Spot/preemptible instances | 60-70% | Low (add checkpointing) | Jobs may be interrupted; need good resume logic |
| Mixed precision (BF16) | 40-50% (2x throughput) | Low (one line of code) | Rare numerical issues with some architectures |
| Right-size GPU selection | 30-50% | Medium (profiling) | Requires understanding memory/compute needs |
| Gradient accumulation | 20-40% (fewer GPUs) | Low | Slower per-step; equivalent to larger batch size |
| Early stopping | 20-50% (fewer epochs) | Low | Must monitor correctly to avoid underfitting |
| Efficient data loading | 10-30% (less GPU idle time) | Medium | Requires profiling DataLoader bottlenecks |
| Reserved instances | 30-40% (1-year commit) | None (procurement) | Capital commitment; less flexibility |
# Quick wins: cost optimization in training code
# 1. Mixed precision - 2x throughput for free
scaler = torch.amp.GradScaler("cuda")
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
loss = model(**batch).loss
# 2. Gradient accumulation - train with large effective batch on fewer GPUs
accumulation_steps = 4 # Simulates 4x the batch size
for i, batch in enumerate(loader):
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
loss = model(**batch).loss / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
# 3. torch.compile - 10-30% speedup with no code changes
model = torch.compile(model, mode="max-autotune")
# 4. Efficient DataLoader - prevent GPU idle waiting for data
loader = DataLoader(
dataset,
batch_size=64,
num_workers=8, # Match CPU cores available
pin_memory=True, # Faster CPU->GPU transfer
prefetch_factor=2, # Prefetch 2 batches per worker
persistent_workers=True, # Don't respawn workers each epoch
)
# 5. Checkpoint periodically for spot instance resilience
if step % 500 == 0:
torch.save({
"step": step,
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"scaler": scaler.state_dict(),
}, f"s3://checkpoints/run-{run_id}/step-{step}.pt")
Debugging Training Failures
Training failures are inevitable. Here is a systematic debugging guide for the most common issues.
| Symptom | Likely Cause | Fix |
|---|---|---|
| Loss is NaN | Learning rate too high, bad data (NaN/Inf in features), or numeric overflow | Reduce LR by 10x, add data validation, enable gradient clipping, check for division by zero |
| Loss plateaus immediately | Learning rate too low, frozen layers, or broken data pipeline returning same batch | Increase LR, verify requires_grad=True, add data logging to confirm batch variety |
| CUDA OOM | Batch size too large, model too large, or memory leak | Reduce batch size, use gradient accumulation, enable FSDP, check for tensor accumulation in lists |
| DDP hangs | One rank crashed silently, or mismatched forward pass (some ranks skip a branch) | Set NCCL_DEBUG=INFO, add timeout to dist.barrier(), ensure all ranks execute same operations |
| Training slow | Data loading bottleneck, GPU underutilized, or excessive logging | Profile with torch.profiler, increase num_workers, reduce logging frequency |
| Val loss goes up | Overfitting, data leakage in splits, or LR too high late in training | Add regularization (dropout, weight decay), verify split integrity, add LR decay |
# Debug toolkit - add to every training script
# 1. Detect NaN/Inf early
torch.autograd.set_detect_anomaly(True) # Slow but catches NaN source
# 2. Monitor GPU memory
def log_gpu_memory(step):
allocated = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
max_allocated = torch.cuda.max_memory_allocated() / 1e9
print(f"[Step {step}] GPU Memory: allocated={allocated:.1f}GB, reserved={reserved:.1f}GB, peak={max_allocated:.1f}GB")
# 3. Profile training bottlenecks
from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
on_trace_ready=tensorboard_trace_handler("./profiler_logs"),
record_shapes=True,
profile_memory=True,
with_stack=True,
) as prof:
for step, batch in enumerate(loader):
if step >= 10:
break
loss = model(**batch).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
prof.step()
# View with: tensorboard --logdir=./profiler_logs
Frequently Asked Questions
How often should I retrain my model?
It depends on how fast your data distribution changes. For recommendation systems and ad models, daily or weekly retraining is common because user behavior shifts rapidly. For fraud detection, weekly to monthly is typical. For NLP models on stable domains, monthly to quarterly. The right answer is: retrain when your monitoring shows performance degradation. Start with weekly, measure drift, and adjust. Never retrain more often than your data meaningfully changes.
DDP vs FSDP vs DeepSpeed — which should I use?
DDP if your model fits on a single GPU (most models under 3B parameters on A100 80GB). It is the simplest and most efficient option. FSDP if you are using PyTorch and your model does not fit on one GPU. It is now the recommended PyTorch-native solution. DeepSpeed if you need ZeRO-Offload (CPU offloading), have very large models (70B+), or are using the HuggingFace ecosystem which has deep DeepSpeed integration. In practice, FSDP and DeepSpeed ZeRO-2/3 achieve similar performance for most workloads.
How do I handle checkpointing for spot instances?
Save checkpoints to durable storage (S3, GCS) every 15-30 minutes. Each checkpoint should include: model state, optimizer state, learning rate scheduler state, current epoch/step, random number generator states, and the data loader position. On restart, load the latest checkpoint and resume. With FSDP/DeepSpeed, use their built-in distributed checkpointing which handles sharded state across GPUs. Budget 2-5 minutes per checkpoint for large models, and ensure your checkpoint frequency is shorter than the average spot reclamation interval (~2 hours on AWS).
What is the most cost-effective GPU for training in 2026?
For fine-tuning and models under 7B: NVIDIA L4 or A10G on spot instances — best price/performance for smaller workloads. For full training of 7B-13B models: A100 80GB remains the sweet spot due to mature ecosystem and competitive spot pricing. For large-scale training (70B+): H100 80GB is worth the premium due to 3x throughput over A100. For budget-constrained teams: AMD MI250X on cloud providers that support it (30-40% cheaper than equivalent NVIDIA). Always benchmark your specific workload — raw TFLOPS do not tell the whole story.
How do I prevent data leakage in my pipeline?
The most common data leakage sources: (1) Temporal leakage — using future data to predict the past. Always split by time, not randomly. (2) Target leakage — a feature that directly encodes the label (e.g., including "refund_amount" when predicting if a transaction is fraudulent). Remove features derived from the target. (3) Group leakage — same user/entity appearing in both train and test sets. Use GroupKFold or StratifiedGroupKFold. (4) Preprocessing leakage — fitting normalizers or encoders on the full dataset including test data. Always fit on train only, transform on test. Add assertions in your pipeline that verify no overlap between splits.
Should I use Airflow, Kubeflow, Prefect, or Dagster?
Use Airflow if your company already uses it for data engineering and you want one orchestrator for everything. It works but is not ML-native. Use Kubeflow Pipelines if you are all-in on Kubernetes and GCP (Vertex AI), and need built-in GPU scheduling. The learning curve is steep. Use Prefect if you want the fastest time to production with the least complexity — best for small to medium ML teams. Use Dagster if you care about data lineage and asset-centric workflows — best when your ML pipeline is tightly coupled with data engineering. Start with the simplest option that meets your needs.
How do I calculate my training cost per model?
Formula: (num_GPUs x GPU_hourly_rate x training_hours) + (storage_GB x storage_rate) + (data_transfer_GB x transfer_rate). Example: Fine-tuning Llama-3-8B on 8x A100 spot instances for 4 hours = 8 x $1.10/hr x 4h = $35.20 compute + ~$2 storage = ~$37 total. For full pretraining of a 70B model: 256x H100 for 2 weeks = 256 x $3.75/hr x 336h = $322,560. Always track actual costs per run in your experiment tracker and compare against estimates. Include failed runs in your total cost calculation — they typically add 20-30% overhead.
Course Summary
You have completed the Designing ML Training Pipelines course. Here is a recap of the key concepts from each lesson:
| Lesson | Key Takeaway |
|---|---|
| 1. Pipeline Architecture | Move from notebooks to automated pipelines with 6 stages. Choose your orchestrator (Airflow, Kubeflow, Prefect, Dagster) based on team and infrastructure. |
| 2. Data Preparation | Validate data before training (Great Expectations, TFDV). Use time-based splits. Handle imbalance with class weights or focal loss. |
| 3. Distributed Training | Use DDP for models that fit on one GPU, FSDP/DeepSpeed for larger models. Expect 85-95% scaling efficiency on single-node NVLink. |
| 4. Experiment Tracking | Track every run with MLflow or W&B. Use the model registry for staging-to-production promotion. Never skip this step. |
| 5. GPU Scheduling | Use Kueue or Volcano for batch scheduling. Implement fair-share with quotas and preemption. Track GPU costs per team and job. |
| 6. CI/CD for ML | Automate retraining with multiple triggers (schedule, drift, code change). Quality gates block bad models. Always have a tested rollback plan. |
| 7. Best Practices | Follow the production checklist. Optimize costs with spot instances, mixed precision, and right-sized GPUs. Debug systematically — check data first. |
Lilly Tech Systems