GPU Programming Best Practices
Practical guidance for optimizing GPU performance, debugging CUDA issues, managing memory, and deploying GPU workloads in production.
Performance Optimization Checklist
Enable Mixed Precision
Use AMP (Automatic Mixed Precision) for ~2x speedup and 50% memory reduction. This is the single highest-impact optimization.
Use torch.compile
Compile your model for automatic kernel fusion and optimization. Use
mode="max-autotune"for best results.Optimize Data Loading
Use
num_workers > 0,pin_memory=True, andpersistent_workers=Truein DataLoader. Data loading should not be the bottleneck.Maximize GPU Utilization
Use the largest batch size that fits in memory. Check GPU utilization with
nvidia-smi— aim for >90%.Profile Before Optimizing
Use PyTorch Profiler or Nsight Systems to identify actual bottlenecks. Don't guess.
Memory Optimization
| Technique | Memory Savings | Speed Impact |
|---|---|---|
| Mixed precision (FP16) | ~50% | Faster (Tensor Cores) |
| Gradient checkpointing | 60-80% | ~30% slower (recompute) |
| Gradient accumulation | Linear with steps | Minimal |
| torch.no_grad() for eval | ~50% (no grad graph) | Faster |
| In-place operations | Variable | No overhead |
Common CUDA Errors
- CUDA out of memory: Reduce batch size, enable gradient checkpointing, use mixed precision, or move to a bigger GPU.
- CUDA device-side assert: Usually caused by invalid tensor indices (e.g., label out of range). Set
CUDA_LAUNCH_BLOCKING=1to get the exact error line. - CUBLAS error: Often caused by mismatched tensor shapes or NaN values. Check input dimensions carefully.
- NCCL timeout: In distributed training, usually means one GPU crashed or network is slow. Check all processes are running.
Frequently Asked Questions
For learning and small models: RTX 4070 (12GB). For serious training: RTX 4090 (24GB) or A6000 (48GB). For large models: H100 (80GB HBM3). Cloud GPUs (Lambda, RunPod, AWS) are often more cost-effective than buying hardware.
BF16 (bfloat16) is preferred on Ampere and newer GPUs (A100, H100, RTX 30/40 series). It has the same range as FP32 (avoiding overflow issues) with reduced precision. FP16 requires gradient scaling to prevent underflow. If your GPU supports BF16, use it.
Use torch.cuda.memory_summary() to see detailed allocation info. Check for tensors that are accidentally kept in scope (e.g., appending losses to a list without .item()). The PyTorch memory snapshot tool can trace allocations to source code lines.
Lilly Tech Systems