Advanced

GPU Programming Best Practices

Practical guidance for optimizing GPU performance, debugging CUDA issues, managing memory, and deploying GPU workloads in production.

Performance Optimization Checklist

  1. Enable Mixed Precision

    Use AMP (Automatic Mixed Precision) for ~2x speedup and 50% memory reduction. This is the single highest-impact optimization.

  2. Use torch.compile

    Compile your model for automatic kernel fusion and optimization. Use mode="max-autotune" for best results.

  3. Optimize Data Loading

    Use num_workers > 0, pin_memory=True, and persistent_workers=True in DataLoader. Data loading should not be the bottleneck.

  4. Maximize GPU Utilization

    Use the largest batch size that fits in memory. Check GPU utilization with nvidia-smi — aim for >90%.

  5. Profile Before Optimizing

    Use PyTorch Profiler or Nsight Systems to identify actual bottlenecks. Don't guess.

Memory Optimization

TechniqueMemory SavingsSpeed Impact
Mixed precision (FP16)~50%Faster (Tensor Cores)
Gradient checkpointing60-80%~30% slower (recompute)
Gradient accumulationLinear with stepsMinimal
torch.no_grad() for eval~50% (no grad graph)Faster
In-place operationsVariableNo overhead

Common CUDA Errors

  • CUDA out of memory: Reduce batch size, enable gradient checkpointing, use mixed precision, or move to a bigger GPU.
  • CUDA device-side assert: Usually caused by invalid tensor indices (e.g., label out of range). Set CUDA_LAUNCH_BLOCKING=1 to get the exact error line.
  • CUBLAS error: Often caused by mismatched tensor shapes or NaN values. Check input dimensions carefully.
  • NCCL timeout: In distributed training, usually means one GPU crashed or network is slow. Check all processes are running.

Frequently Asked Questions

For learning and small models: RTX 4070 (12GB). For serious training: RTX 4090 (24GB) or A6000 (48GB). For large models: H100 (80GB HBM3). Cloud GPUs (Lambda, RunPod, AWS) are often more cost-effective than buying hardware.

BF16 (bfloat16) is preferred on Ampere and newer GPUs (A100, H100, RTX 30/40 series). It has the same range as FP32 (avoiding overflow issues) with reduced precision. FP16 requires gradient scaling to prevent underflow. If your GPU supports BF16, use it.

Use torch.cuda.memory_summary() to see detailed allocation info. Check for tensors that are accidentally kept in scope (e.g., appending losses to a list without .item()). The PyTorch memory snapshot tool can trace allocations to source code lines.

Congratulations! You have completed the GPU Programming for AI course. You now understand CUDA fundamentals, cuDNN, PyTorch GPU optimization, and multi-GPU training. Profile your workloads, use mixed precision, and scale up with DDP!