Advanced

Multi-GPU Training

When one GPU isn't enough, distribute training across multiple GPUs to train larger models and process more data in parallel.

DataParallel vs DistributedDataParallel

Feature	DataParallel (DP)	DistributedDataParallel (DDP)
Architecture	Single process, multiple threads	Multiple processes, one per GPU
Communication	GPU 0 bottleneck (gather/scatter)	AllReduce across all GPUs (balanced)
Performance	Poor scaling due to GIL	Near-linear scaling
Multi-node	No	Yes
Recommendation	Avoid	Always use DDP

Setting Up DDP

Python - DistributedDataParallel

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def train(rank, world_size):
    setup(rank, world_size)

    model = MyModel().to(rank)
    model = DDP(model, device_ids=[rank])

    # Each GPU gets a different data shard
    sampler = torch.utils.data.distributed.DistributedSampler(
        dataset, num_replicas=world_size, rank=rank
    )
    dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

    for epoch in range(num_epochs):
        sampler.set_epoch(epoch)  # Shuffle differently each epoch
        for inputs, labels in dataloader:
            outputs = model(inputs.to(rank))
            loss = criterion(outputs, labels.to(rank))
            loss.backward()  # Gradients synced via AllReduce
            optimizer.step()
            optimizer.zero_grad()

# Launch: torchrun --nproc_per_node=4 train.py

GPU Interconnects

PCIe: Standard connection, ~32 GB/s per direction (PCIe 4.0 x16). Sufficient for 2-4 GPUs.
NVLink: High-bandwidth GPU-to-GPU link, ~900 GB/s on H100. Essential for model parallelism.
NVSwitch: Connects all GPUs in a node with full NVLink bandwidth. Available in DGX systems.
InfiniBand: High-bandwidth, low-latency networking for multi-node training, ~400 Gb/s.

NCCL Communication

NCCL (NVIDIA Collective Communications Library) provides optimized collective operations for multi-GPU communication:

AllReduce: Sum gradients across all GPUs and distribute the result — the core operation for data parallelism
AllGather: Collect data from all GPUs to all GPUs
ReduceScatter: Reduce and distribute different chunks to different GPUs — used in FSDP/ZeRO
Broadcast: Send data from one GPU to all others

✅

Key takeaway: Always use DistributedDataParallel over DataParallel. DDP provides near-linear scaling by running one process per GPU and using NCCL AllReduce for gradient synchronization. Use torchrun to launch distributed training.

← Previous PyTorch GPU Next → Best Practices