Advanced
Multi-GPU Training
When one GPU isn't enough, distribute training across multiple GPUs to train larger models and process more data in parallel.
DataParallel vs DistributedDataParallel
| Feature | DataParallel (DP) | DistributedDataParallel (DDP) |
|---|---|---|
| Architecture | Single process, multiple threads | Multiple processes, one per GPU |
| Communication | GPU 0 bottleneck (gather/scatter) | AllReduce across all GPUs (balanced) |
| Performance | Poor scaling due to GIL | Near-linear scaling |
| Multi-node | No | Yes |
| Recommendation | Avoid | Always use DDP |
Setting Up DDP
Python - DistributedDataParallel
import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP def setup(rank, world_size): dist.init_process_group("nccl", rank=rank, world_size=world_size) torch.cuda.set_device(rank) def train(rank, world_size): setup(rank, world_size) model = MyModel().to(rank) model = DDP(model, device_ids=[rank]) # Each GPU gets a different data shard sampler = torch.utils.data.distributed.DistributedSampler( dataset, num_replicas=world_size, rank=rank ) dataloader = DataLoader(dataset, sampler=sampler, batch_size=32) for epoch in range(num_epochs): sampler.set_epoch(epoch) # Shuffle differently each epoch for inputs, labels in dataloader: outputs = model(inputs.to(rank)) loss = criterion(outputs, labels.to(rank)) loss.backward() # Gradients synced via AllReduce optimizer.step() optimizer.zero_grad() # Launch: torchrun --nproc_per_node=4 train.py
GPU Interconnects
- PCIe: Standard connection, ~32 GB/s per direction (PCIe 4.0 x16). Sufficient for 2-4 GPUs.
- NVLink: High-bandwidth GPU-to-GPU link, ~900 GB/s on H100. Essential for model parallelism.
- NVSwitch: Connects all GPUs in a node with full NVLink bandwidth. Available in DGX systems.
- InfiniBand: High-bandwidth, low-latency networking for multi-node training, ~400 Gb/s.
NCCL Communication
NCCL (NVIDIA Collective Communications Library) provides optimized collective operations for multi-GPU communication:
- AllReduce: Sum gradients across all GPUs and distribute the result — the core operation for data parallelism
- AllGather: Collect data from all GPUs to all GPUs
- ReduceScatter: Reduce and distribute different chunks to different GPUs — used in FSDP/ZeRO
- Broadcast: Send data from one GPU to all others
Key takeaway: Always use DistributedDataParallel over DataParallel. DDP provides near-linear scaling by running one process per GPU and using NCCL AllReduce for gradient synchronization. Use
torchrun to launch distributed training.
Lilly Tech Systems