Beginner

Introduction to Distributed Training

Modern AI models are too large and data-hungry to train on a single GPU. Distributed training spreads the work across multiple GPUs and nodes to make large-scale training practical.

Why Distribute?

There are two fundamental reasons to distribute training:

  • Speed: Training GPT-4-scale models on a single GPU would take decades. With thousands of GPUs working in parallel, it takes weeks.
  • Memory: A 70B parameter model in FP16 needs ~140 GB just for weights. With optimizer states and gradients, that's ~1 TB — far beyond any single GPU.

Scaling Laws

Research by Kaplan et al. and Hoffmann et al. (Chinchilla) showed that model performance follows predictable scaling laws:

  • Performance improves as a power law with model size, data, and compute
  • Optimal training balances model size with data size (Chinchilla scaling)
  • This drives the need for ever-larger compute budgets, which requires distributed training

Types of Parallelism

StrategyWhat is SplitBest For
Data ParallelismTraining data across GPUsModel fits in one GPU, need faster training
Model Parallelism (Tensor)Layer weights across GPUsSingle layers too large for one GPU
Pipeline ParallelismModel layers across GPUsMany layers, fits pipeline stages
ZeRO / FSDPOptimizer states, gradients, weightsMemory-efficient data parallelism
Expert ParallelismMoE experts across GPUsMixture-of-Experts models

Communication Overhead

The fundamental challenge of distributed training is communication. GPUs must share information (gradients, activations) during training:

  • AllReduce: Sum gradients from all GPUs and distribute the result. Cost scales with model size, not GPU count (with ring AllReduce).
  • Point-to-point: Send activations between pipeline stages. Latency-sensitive.
  • Bandwidth: NVLink (900 GB/s) >> InfiniBand (400 Gb/s) >> Ethernet (100 Gb/s). Interconnect speed is often the bottleneck.

Key Tools

  1. PyTorch DDP

    Built-in data parallelism. The simplest starting point for multi-GPU training.

  2. PyTorch FSDP

    Fully Sharded Data Parallel. Shards model parameters across GPUs for memory efficiency.

  3. DeepSpeed

    Microsoft's library with ZeRO optimizer stages, offloading, and advanced optimizations.

  4. Megatron-LM

    NVIDIA's framework for tensor and pipeline parallelism in LLM training.

  5. torchrun / accelerate

    Launch utilities that handle process group setup and distributed initialization.

Key takeaway: Distributed training is essential for modern AI. Start with data parallelism (DDP) for most use cases. As models grow beyond single-GPU memory, graduate to FSDP or DeepSpeed ZeRO. The key challenge is minimizing communication overhead.