Distributed Training Questions
Distributed training is the core competency for AI infrastructure engineers. These 10 questions cover data parallelism, model parallelism, communication primitives, modern frameworks, and fault tolerance — the topics that appear in virtually every AI infrastructure interview.
Q1: Explain data parallelism. How does it work and what are its limitations?
Answer: Data parallelism is the simplest distributed training strategy. Each GPU holds a complete copy of the model, and the training data is split across GPUs. Each GPU processes its mini-batch independently, then all GPUs synchronize gradients before updating weights.
Steps per iteration:
- Each GPU receives a different mini-batch of data
- Each GPU performs forward pass independently
- Each GPU computes gradients independently
- AllReduce: Average gradients across all GPUs
- Each GPU updates its local model weights with averaged gradients
Limitations:
- Model must fit on single GPU: Every GPU holds a full copy of model weights, optimizer states, and activations. A 70B model requires ~280 GB (weights + optimizer + gradients in mixed precision), far exceeding single GPU memory.
- Communication overhead scales with model size: AllReduce must transfer gradient data proportional to model parameters. For large models, this becomes the bottleneck.
- Diminishing returns at scale: As you add more GPUs, the communication-to-computation ratio increases. At some point, GPUs spend more time waiting for gradient synchronization than computing.
- Batch size scaling: More GPUs = larger effective batch size. Very large batch sizes can degrade model quality unless you use learning rate warmup and scaling rules (linear scaling rule: double batch size = double learning rate).
Q2: What is AllReduce and how does the ring AllReduce algorithm work?
Answer: AllReduce is a collective communication operation where each participant starts with a local value, and all participants end up with the global aggregate (typically sum or average). In distributed training, it is used to average gradients across all GPUs.
Ring AllReduce: The most common implementation. N GPUs are arranged in a logical ring. The algorithm proceeds in two phases:
Phase 1 — Reduce-Scatter: Each GPU splits its gradient buffer into N chunks. In N-1 steps, each GPU sends one chunk to the next GPU in the ring and receives one chunk from the previous GPU, accumulating (adding) the received data. After N-1 steps, each GPU has the fully reduced version of one chunk.
Phase 2 — AllGather: In N-1 more steps, each GPU sends its fully reduced chunk around the ring so every GPU ends up with all N fully reduced chunks.
Why ring AllReduce is efficient:
- Total data transferred per GPU: 2 x (N-1)/N x data_size. This approaches 2x data_size regardless of N — it scales well.
- Each GPU only communicates with its two neighbors, maximizing bandwidth utilization.
- The algorithm saturates the interconnect bandwidth, achieving near-theoretical throughput on NVLink.
Alternatives: Tree AllReduce (lower latency for small messages), recursive halving-doubling (good for power-of-2 GPU counts), and NCCL's adaptive algorithms that choose the best strategy based on message size and topology.
Q3: What is NCCL and why is it important for distributed training?
Answer: NCCL (NVIDIA Collective Communications Library, pronounced "nickel") is NVIDIA's optimized library for multi-GPU and multi-node collective communication. It provides AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter operations.
Why NCCL matters:
- Topology-aware: NCCL detects the GPU interconnect topology (NVLink, NVSwitch, PCIe, InfiniBand) and automatically selects the optimal communication algorithm. It knows that intra-node NVLink is faster than inter-node InfiniBand and routes accordingly.
- Hierarchical AllReduce: For multi-node training, NCCL first does intra-node AllReduce over NVLink, then inter-node AllReduce over InfiniBand, then broadcasts results within nodes. This minimizes expensive cross-network transfers.
- Kernel fusion: NCCL fuses communication operations with computation when possible, overlapping gradient AllReduce with backward pass computation.
- GDR (GPUDirect RDMA): Enables direct GPU-to-GPU data transfer across nodes without going through CPU memory. Reduces latency and increases bandwidth for inter-node communication.
Common NCCL debugging:
NCCL_DEBUG=INFO: Enable verbose logging to see topology detection, algorithm selection, and ring/tree construction.NCCL_IB_DISABLE=1: Force TCP instead of InfiniBand for debugging network issues.NCCL_P2P_DISABLE=1: Disable peer-to-peer (NVLink) for debugging inter-GPU communication.- NCCL timeout errors usually indicate a straggler GPU (hardware fault, thermal throttling) or network issue, not a NCCL bug.
Q4: Explain model parallelism: tensor parallelism vs pipeline parallelism.
Answer: Model parallelism splits the model across GPUs instead of replicating it. Two main approaches:
Tensor Parallelism (TP): Splits individual layers across GPUs. For a transformer, the attention and FFN weight matrices are partitioned column-wise or row-wise across GPUs. Each GPU computes a portion of the layer output, then GPUs exchange results via AllReduce/AllGather.
- Pro: Low memory per GPU (weights split evenly). Simple to implement with Megatron-LM patterns.
- Con: Requires high-bandwidth interconnect (NVLink) because every layer requires communication. Not practical across nodes with InfiniBand — too much latency.
- Typical usage: TP=2, 4, or 8 within a single NVLink-connected node.
Pipeline Parallelism (PP): Splits the model by layers across GPUs. GPU 0 gets layers 1–10, GPU 1 gets layers 11–20, etc. Micro-batches are pipelined through stages to minimize bubble overhead.
- Pro: Only sends activations between stages (smaller than AllReduce). Works across nodes with InfiniBand because communication is point-to-point and less frequent.
- Con: Pipeline bubble — GPUs are idle while waiting for their stage's input. Bubble ratio = (P-1)/(M+P-1) where P = pipeline stages, M = micro-batches. Mitigated by using many micro-batches.
- Typical usage: PP=2, 4, 8 across nodes.
3D Parallelism (DP + TP + PP): Modern large model training uses all three. Example: training a 175B model on 512 GPUs with DP=64, TP=4, PP=2. TP within NVLink nodes, PP across nodes, DP across all groups.
Q5: What is DeepSpeed and how do its ZeRO optimization stages work?
Answer: DeepSpeed is Microsoft's open-source library for efficient distributed training. Its flagship feature is ZeRO (Zero Redundancy Optimizer), which eliminates memory redundancy in data parallelism.
The problem ZeRO solves: In standard data parallelism, every GPU stores a full copy of model parameters, gradients, and optimizer states. For a 7B FP16 model with Adam: 14 GB params + 14 GB gradients + 56 GB optimizer = 84 GB per GPU. With 8 GPUs, total memory used = 672 GB, but unique data = 84 GB. That is 8x redundancy.
ZeRO stages:
- ZeRO-1 (Optimizer State Partitioning): Shards optimizer states across GPUs. Each GPU stores 1/N of optimizer states. Memory reduction: ~4x for Adam. Communication: same as standard data parallelism.
- ZeRO-2 (+ Gradient Partitioning): Also shards gradients. Each GPU only stores gradients for its optimizer state partition. Memory reduction: ~8x. Communication: slightly more than ZeRO-1 (reduce-scatter instead of AllReduce for gradients).
- ZeRO-3 (+ Parameter Partitioning): Also shards model parameters. GPUs gather parameters on-demand during forward and backward passes, then discard them. Memory reduction: linear with GPU count. Communication: significantly more — AllGather for params during forward and backward.
ZeRO-Offload/Infinity: Extends ZeRO-3 by offloading optimizer states, gradients, or parameters to CPU memory or NVMe SSDs. Enables training a 1T parameter model on a single DGX node at the cost of slower training speed.
When to use each stage: ZeRO-1 for most cases (free memory savings, no speed penalty). ZeRO-2 when ZeRO-1 is insufficient. ZeRO-3 only when the model cannot fit any other way — the communication overhead is significant.
Q6: What is FSDP and how does it compare to DeepSpeed ZeRO?
Answer: FSDP (Fully Sharded Data Parallelism) is PyTorch's native implementation of ZeRO-3-style training. It shards model parameters, gradients, and optimizer states across GPUs and gathers them on-demand.
How FSDP works:
- Model is wrapped in FSDP, which flattens and shards parameters across GPUs
- During forward pass: AllGather to reconstruct full parameters for each FSDP unit, compute, then discard non-local parameters
- During backward pass: AllGather parameters again, compute gradients, ReduceScatter to shard gradients
- Each GPU updates only its shard of optimizer states and parameters
FSDP vs DeepSpeed ZeRO:
| Feature | PyTorch FSDP | DeepSpeed ZeRO |
|---|---|---|
| Integration | Native PyTorch, minimal code changes | Requires DeepSpeed launcher and config |
| Flexibility | ZeRO-3 equivalent only | ZeRO-1, 2, 3, Offload, Infinity |
| Sharding strategy | FULL_SHARD, SHARD_GRAD_OP, NO_SHARD | Stage 1, 2, 3 with fine-grained control |
| CPU offloading | Supported but less mature | Well-tested ZeRO-Offload and ZeRO-Infinity |
| Ecosystem | Works with any PyTorch code, HuggingFace native | Broader feature set: sparse attention, compression |
Recommendation: Use FSDP for pure PyTorch projects where simplicity matters. Use DeepSpeed when you need ZeRO-1/2 (less overhead than full sharding), CPU/NVMe offloading, or advanced features like sparse attention and communication compression.
Q7: How do you handle fault tolerance in large-scale distributed training?
Answer: At scale (hundreds to thousands of GPUs), hardware failures are expected, not exceptional. A 1,000-GPU cluster with 0.1% daily failure rate per GPU will see ~1 GPU failure per day. Fault tolerance must be designed into the system.
Checkpointing strategies:
- Periodic checkpointing: Save model state, optimizer state, and RNG state every N steps. Trade-off: frequent checkpointing (every 100 steps) wastes I/O bandwidth; infrequent (every 10,000 steps) wastes compute on recovery. Typical: every 500–1,000 steps or 15–30 minutes.
- Asynchronous checkpointing: Snapshot state to CPU memory in the background while training continues. Write to persistent storage asynchronously. Libraries: PyTorch DCP (Distributed Checkpoint), Nebula (Azure).
- Sharded checkpointing: Each GPU saves its shard independently. Avoids bottleneck of gathering full state on one GPU. Enables resume with different GPU count.
Elastic training:
- PyTorch Elastic (TorchElastic): Automatically detects failed workers, restarts them, and resumes training from the latest checkpoint. Supports scaling up/down without restarting the entire job.
- In-memory redundancy: Replicate optimizer state across multiple GPUs so that a single GPU failure does not require loading from disk. Faster recovery at the cost of memory.
- Heartbeat monitoring: Each worker sends periodic heartbeats. If a heartbeat is missed, the coordinator marks the worker as failed and triggers recovery.
Infrastructure-level fault tolerance:
- Health checks before launching training: verify GPU ECC errors, NVLink status, InfiniBand link health, NCCL connectivity tests
- Automatic GPU replacement: detect faulty GPUs via DCGM, drain the node, replace or disable the GPU, re-add to scheduler
- Network redundancy: dual-rail InfiniBand (two independent networks) so a single switch failure does not halt training
Q8: What is gradient accumulation and when should you use it?
Answer: Gradient accumulation simulates a larger batch size by accumulating gradients over multiple forward-backward passes before performing a weight update. Instead of updating weights every mini-batch, you update every K mini-batches.
Why it matters:
- Memory constraint: When the desired batch size does not fit in GPU memory, gradient accumulation lets you achieve the same effective batch size with a smaller per-GPU batch size.
- Communication reduction: In data parallelism, you only need to AllReduce gradients once per weight update, not once per micro-batch. With K accumulation steps, communication is reduced by K.
- Example: You want effective batch size 1024, have 8 GPUs, and each GPU can fit batch size 16. Per-GPU effective: 1024/8 = 128. Accumulation steps: 128/16 = 8. Each GPU does 8 forward-backward passes, accumulates gradients, then does one AllReduce and weight update.
Important details:
- Loss must be scaled by 1/K (or equivalently, use mean reduction and do not scale)
- Batch normalization statistics are computed per micro-batch, not per effective batch — this can affect convergence. Use Group Normalization or Layer Normalization for large models.
- Learning rate should be scaled relative to effective batch size, not micro-batch size
Q9: How do you scale distributed training from 8 GPUs to 1,000+ GPUs? What changes?
Answer: Scaling from 8 to 1,000+ GPUs introduces qualitative changes, not just quantitative ones:
Communication becomes the bottleneck:
- At 8 GPUs (single node), NVLink provides 900 GB/s. Communication is nearly free.
- At 1,000 GPUs (125 nodes), inter-node communication crosses InfiniBand at 400 Gbps (50 GB/s) per node. AllReduce time grows with the number of nodes.
- Solution: Use 3D parallelism. TP within nodes (NVLink-fast), PP across a few nodes (point-to-point, moderate bandwidth needed), DP across all node groups (AllReduce, but only within DP groups, not globally).
Gradient compression and overlap:
- Overlap AllReduce with backward computation: start AllReduce for layer N's gradients while computing layer N-1's gradients
- Gradient compression (1-bit Adam, PowerSGD) reduces communication volume by 10–100x at the cost of slight convergence differences
Batch size and learning rate:
- 1,000 GPUs with batch 32 per GPU = effective batch 32,000. Very large batches can destabilize training.
- Use linear scaling rule (scale LR proportionally to batch size) with warmup period
- LARS/LAMB optimizers designed specifically for large-batch training
Reliability and operations:
- At 1,000 GPUs, expect 1–3 hardware failures per day. Elastic training and checkpointing are mandatory.
- Need a job scheduler (Slurm, Kubernetes + Volcano) to manage GPU allocation across teams
- Pre-flight health checks on all GPUs before launching multi-day training runs
Q10: What is the difference between synchronous and asynchronous distributed training?
Answer:
Synchronous training (standard approach): All GPUs compute gradients, then wait for AllReduce to complete before updating weights. Every GPU has identical model weights at all times.
- Pro: Mathematically equivalent to single-GPU training with larger batch size. Deterministic. No staleness.
- Con: Slowest GPU determines iteration time (straggler problem). All GPUs blocked during AllReduce.
Asynchronous training: Each GPU computes gradients and updates a shared parameter server without waiting for other GPUs. GPUs may read stale weights.
- Pro: No straggler problem. Higher GPU utilization. Scales to very large clusters.
- Con: Gradient staleness degrades convergence. Model weights on different GPUs diverge. Harder to debug and reproduce.
In practice: Synchronous training dominates modern large-scale training (GPT-4, Llama, etc.) because:
- NVLink and InfiniBand have made communication fast enough that AllReduce overhead is manageable
- Convergence guarantees are critical when a training run costs millions of dollars
- Straggler mitigation (backup workers, health checks, elastic training) addresses synchronous training's weakness
Asynchronous SGD is still used for very large-scale recommendation models (e.g., Meta's DLRM) where the embedding tables are too large for all-reduce and parameter servers are used for sparse updates.
Lilly Tech Systems