Advanced

GPU Programming Best Practices

Practical guidance for optimizing GPU performance, debugging CUDA issues, managing memory, and deploying GPU workloads in production.

Performance Optimization Checklist

Enable Mixed Precision
Use AMP (Automatic Mixed Precision) for ~2x speedup and 50% memory reduction. This is the single highest-impact optimization.
Use torch.compile
Compile your model for automatic kernel fusion and optimization. Use mode="max-autotune" for best results.
Optimize Data Loading
Use num_workers > 0, pin_memory=True, and persistent_workers=True in DataLoader. Data loading should not be the bottleneck.
Maximize GPU Utilization
Use the largest batch size that fits in memory. Check GPU utilization with nvidia-smi — aim for >90%.
Profile Before Optimizing
Use PyTorch Profiler or Nsight Systems to identify actual bottlenecks. Don't guess.

Memory Optimization

Technique	Memory Savings	Speed Impact
Mixed precision (FP16)	~50%	Faster (Tensor Cores)
Gradient checkpointing	60-80%	~30% slower (recompute)
Gradient accumulation	Linear with steps	Minimal
torch.no_grad() for eval	~50% (no grad graph)	Faster
In-place operations	Variable	No overhead

Common CUDA Errors

CUDA out of memory: Reduce batch size, enable gradient checkpointing, use mixed precision, or move to a bigger GPU.
CUDA device-side assert: Usually caused by invalid tensor indices (e.g., label out of range). Set CUDA_LAUNCH_BLOCKING=1 to get the exact error line.
CUBLAS error: Often caused by mismatched tensor shapes or NaN values. Check input dimensions carefully.
NCCL timeout: In distributed training, usually means one GPU crashed or network is slow. Check all processes are running.

Frequently Asked Questions

For learning and small models: RTX 4070 (12GB). For serious training: RTX 4090 (24GB) or A6000 (48GB). For large models: H100 (80GB HBM3). Cloud GPUs (Lambda, RunPod, AWS) are often more cost-effective than buying hardware.

BF16 (bfloat16) is preferred on Ampere and newer GPUs (A100, H100, RTX 30/40 series). It has the same range as FP32 (avoiding overflow issues) with reduced precision. FP16 requires gradient scaling to prevent underflow. If your GPU supports BF16, use it.

Use torch.cuda.memory_summary() to see detailed allocation info. Check for tensors that are accidentally kept in scope (e.g., appending losses to a list without .item()). The PyTorch memory snapshot tool can trace allocations to source code lines.

✅

Congratulations! You have completed the GPU Programming for AI course. You now understand CUDA fundamentals, cuDNN, PyTorch GPU optimization, and multi-GPU training. Profile your workloads, use mixed precision, and scale up with DDP!

← Previous Multi-GPU

GPU Programming Best Practices

Performance Optimization Checklist

Enable Mixed Precision

Use torch.compile

Optimize Data Loading

Maximize GPU Utilization

Profile Before Optimizing

Memory Optimization

Common CUDA Errors

Frequently Asked Questions