Data Parallelism
Data parallelism is the simplest and most common form of distributed training: replicate the model on each GPU and split the training data across them.
How It Works
Replicate
Copy the entire model to each GPU. All replicas start with identical weights.
Partition Data
Split each training batch across GPUs. With 4 GPUs and batch size 256, each GPU processes 64 samples.
Forward Pass
Each GPU computes the forward pass on its local data shard independently.
Backward Pass
Each GPU computes gradients on its local data. Gradients differ because the data differs.
AllReduce
Average gradients across all GPUs using AllReduce. After this, all GPUs have identical averaged gradients.
Update
Each GPU applies the same averaged gradients, keeping all model replicas synchronized.
PyTorch DDP
import os import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP from torch.utils.data import DataLoader, DistributedSampler def main(): # torchrun sets these environment variables dist.init_process_group("nccl") rank = dist.get_rank() world_size = dist.get_world_size() torch.cuda.set_device(rank) # Model: same initialization on all ranks model = MyModel().cuda(rank) model = DDP(model, device_ids=[rank]) # Data: each rank gets a unique shard sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank) loader = DataLoader(dataset, batch_size=64, sampler=sampler, num_workers=4, pin_memory=True) optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4) for epoch in range(10): sampler.set_epoch(epoch) # Different shuffle each epoch for batch in loader: inputs = batch["input"].cuda(rank) labels = batch["label"].cuda(rank) loss = model(inputs, labels) loss.backward() # Gradients synced automatically optimizer.step() optimizer.zero_grad() dist.destroy_process_group() # Launch: torchrun --nproc_per_node=4 train.py
Ring AllReduce
The most common AllReduce implementation uses a ring topology:
- GPUs are arranged in a logical ring
- Each GPU sends a chunk of its gradients to the next GPU in the ring
- After N-1 steps (N = number of GPUs), all GPUs have the complete sum
- Communication cost is proportional to model size, independent of GPU count
Effective Batch Size
With data parallelism, the effective batch size scales with the number of GPUs:
- Per-GPU batch size: 64 samples
- 4 GPUs: Effective batch size = 256
- Learning rate scaling: Linear scaling rule — multiply LR by the number of GPUs
- Warmup: Use learning rate warmup (1-5 epochs) when scaling to large batch sizes to maintain training stability
Gradient Accumulation
When per-GPU batch size is limited by memory, gradient accumulation simulates larger batches:
accumulation_steps = 4 # Effective batch = 4 * per_gpu_batch * num_gpus for i, batch in enumerate(loader): loss = model(batch) / accumulation_steps loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()
Lilly Tech Systems