Intermediate

PyTorch GPU Acceleration

PyTorch provides seamless GPU acceleration. Learn to move data, use mixed precision, compile models, and profile performance.

Moving to GPU

Python - Basic GPU Usage
import torch

# Check GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move tensors to GPU
x = torch.randn(1000, 1000, device=device)
y = torch.randn(1000, 1000, device=device)
z = x @ y  # Matrix multiply on GPU

# Move model to GPU
model = MyModel().to(device)

# Training loop
for inputs, labels in dataloader:
    inputs = inputs.to(device)
    labels = labels.to(device)
    outputs = model(inputs)
    loss = criterion(outputs, labels)

Mixed Precision Training

Mixed precision uses FP16 for most computations and FP32 for critical operations, giving ~2x speedup with half the memory:

Python - Automatic Mixed Precision
from torch.amp import autocast, GradScaler

scaler = GradScaler()

for inputs, labels in dataloader:
    inputs = inputs.to(device)
    labels = labels.to(device)

    # Forward pass in FP16
    with autocast(device_type="cuda"):
        outputs = model(inputs)
        loss = criterion(outputs, labels)

    # Backward pass with gradient scaling
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad()

torch.compile

PyTorch 2.0 introduced torch.compile() which JIT-compiles your model for significant speedups:

Python - torch.compile
# One-line speedup: compiles and optimizes the model
model = torch.compile(model)

# Modes: "default", "reduce-overhead", "max-autotune"
model = torch.compile(model, mode="max-autotune")

# First call is slow (compilation), subsequent calls are fast
output = model(input_tensor)  # Compiles here
output = model(input_tensor)  # Fast!

GPU Memory Management

  • Monitor usage: torch.cuda.memory_allocated() and torch.cuda.max_memory_allocated()
  • Gradient checkpointing: Trade compute for memory by recomputing activations during backward pass
  • Clear cache: torch.cuda.empty_cache() releases unused memory back to the CUDA allocator
  • Pin memory: Use pin_memory=True in DataLoader for faster CPU-to-GPU transfers

Profiling

Python - PyTorch Profiler
from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    with_stack=True,
) as prof:
    output = model(input_tensor)

print(prof.key_averages().table(sort_by="cuda_time_total"))
# Export for Chrome trace viewer or TensorBoard
prof.export_chrome_trace("trace.json")
Key takeaway: Use mixed precision for ~2x speedup, torch.compile() for automatic optimization, and the PyTorch profiler to identify bottlenecks. Pin memory in DataLoaders and use gradient checkpointing when memory is tight.