Intermediate
PyTorch GPU Acceleration
PyTorch provides seamless GPU acceleration. Learn to move data, use mixed precision, compile models, and profile performance.
Moving to GPU
Python - Basic GPU Usage
import torch # Check GPU availability device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Move tensors to GPU x = torch.randn(1000, 1000, device=device) y = torch.randn(1000, 1000, device=device) z = x @ y # Matrix multiply on GPU # Move model to GPU model = MyModel().to(device) # Training loop for inputs, labels in dataloader: inputs = inputs.to(device) labels = labels.to(device) outputs = model(inputs) loss = criterion(outputs, labels)
Mixed Precision Training
Mixed precision uses FP16 for most computations and FP32 for critical operations, giving ~2x speedup with half the memory:
Python - Automatic Mixed Precision
from torch.amp import autocast, GradScaler scaler = GradScaler() for inputs, labels in dataloader: inputs = inputs.to(device) labels = labels.to(device) # Forward pass in FP16 with autocast(device_type="cuda"): outputs = model(inputs) loss = criterion(outputs, labels) # Backward pass with gradient scaling scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() optimizer.zero_grad()
torch.compile
PyTorch 2.0 introduced torch.compile() which JIT-compiles your model for significant speedups:
Python - torch.compile
# One-line speedup: compiles and optimizes the model model = torch.compile(model) # Modes: "default", "reduce-overhead", "max-autotune" model = torch.compile(model, mode="max-autotune") # First call is slow (compilation), subsequent calls are fast output = model(input_tensor) # Compiles here output = model(input_tensor) # Fast!
GPU Memory Management
- Monitor usage:
torch.cuda.memory_allocated()andtorch.cuda.max_memory_allocated() - Gradient checkpointing: Trade compute for memory by recomputing activations during backward pass
- Clear cache:
torch.cuda.empty_cache()releases unused memory back to the CUDA allocator - Pin memory: Use
pin_memory=Truein DataLoader for faster CPU-to-GPU transfers
Profiling
Python - PyTorch Profiler
from torch.profiler import profile, ProfilerActivity with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, with_stack=True, ) as prof: output = model(input_tensor) print(prof.key_averages().table(sort_by="cuda_time_total")) # Export for Chrome trace viewer or TensorBoard prof.export_chrome_trace("trace.json")
Key takeaway: Use mixed precision for ~2x speedup,
torch.compile() for automatic optimization, and the PyTorch profiler to identify bottlenecks. Pin memory in DataLoaders and use gradient checkpointing when memory is tight.
Lilly Tech Systems