Advanced

Multi-GPU Training

When one GPU isn't enough, distribute training across multiple GPUs to train larger models and process more data in parallel.

DataParallel vs DistributedDataParallel

FeatureDataParallel (DP)DistributedDataParallel (DDP)
ArchitectureSingle process, multiple threadsMultiple processes, one per GPU
CommunicationGPU 0 bottleneck (gather/scatter)AllReduce across all GPUs (balanced)
PerformancePoor scaling due to GILNear-linear scaling
Multi-nodeNoYes
RecommendationAvoidAlways use DDP

Setting Up DDP

Python - DistributedDataParallel
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def train(rank, world_size):
    setup(rank, world_size)

    model = MyModel().to(rank)
    model = DDP(model, device_ids=[rank])

    # Each GPU gets a different data shard
    sampler = torch.utils.data.distributed.DistributedSampler(
        dataset, num_replicas=world_size, rank=rank
    )
    dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

    for epoch in range(num_epochs):
        sampler.set_epoch(epoch)  # Shuffle differently each epoch
        for inputs, labels in dataloader:
            outputs = model(inputs.to(rank))
            loss = criterion(outputs, labels.to(rank))
            loss.backward()  # Gradients synced via AllReduce
            optimizer.step()
            optimizer.zero_grad()

# Launch: torchrun --nproc_per_node=4 train.py

GPU Interconnects

  • PCIe: Standard connection, ~32 GB/s per direction (PCIe 4.0 x16). Sufficient for 2-4 GPUs.
  • NVLink: High-bandwidth GPU-to-GPU link, ~900 GB/s on H100. Essential for model parallelism.
  • NVSwitch: Connects all GPUs in a node with full NVLink bandwidth. Available in DGX systems.
  • InfiniBand: High-bandwidth, low-latency networking for multi-node training, ~400 Gb/s.

NCCL Communication

NCCL (NVIDIA Collective Communications Library) provides optimized collective operations for multi-GPU communication:

  • AllReduce: Sum gradients across all GPUs and distribute the result — the core operation for data parallelism
  • AllGather: Collect data from all GPUs to all GPUs
  • ReduceScatter: Reduce and distribute different chunks to different GPUs — used in FSDP/ZeRO
  • Broadcast: Send data from one GPU to all others
Key takeaway: Always use DistributedDataParallel over DataParallel. DDP provides near-linear scaling by running one process per GPU and using NCCL AllReduce for gradient synchronization. Use torchrun to launch distributed training.