Intermediate

Distributed Training Questions

Distributed training is the core competency for AI infrastructure engineers. These 10 questions cover data parallelism, model parallelism, communication primitives, modern frameworks, and fault tolerance — the topics that appear in virtually every AI infrastructure interview.

Q1: Explain data parallelism. How does it work and what are its limitations?

💡

Answer: Data parallelism is the simplest distributed training strategy. Each GPU holds a complete copy of the model, and the training data is split across GPUs. Each GPU processes its mini-batch independently, then all GPUs synchronize gradients before updating weights.

Steps per iteration:

Each GPU receives a different mini-batch of data
Each GPU performs forward pass independently
Each GPU computes gradients independently
AllReduce: Average gradients across all GPUs
Each GPU updates its local model weights with averaged gradients

Limitations:

Model must fit on single GPU: Every GPU holds a full copy of model weights, optimizer states, and activations. A 70B model requires ~280 GB (weights + optimizer + gradients in mixed precision), far exceeding single GPU memory.
Communication overhead scales with model size: AllReduce must transfer gradient data proportional to model parameters. For large models, this becomes the bottleneck.
Diminishing returns at scale: As you add more GPUs, the communication-to-computation ratio increases. At some point, GPUs spend more time waiting for gradient synchronization than computing.
Batch size scaling: More GPUs = larger effective batch size. Very large batch sizes can degrade model quality unless you use learning rate warmup and scaling rules (linear scaling rule: double batch size = double learning rate).

Q2: What is AllReduce and how does the ring AllReduce algorithm work?

💡

Answer: AllReduce is a collective communication operation where each participant starts with a local value, and all participants end up with the global aggregate (typically sum or average). In distributed training, it is used to average gradients across all GPUs.

Ring AllReduce: The most common implementation. N GPUs are arranged in a logical ring. The algorithm proceeds in two phases:

Phase 1 — Reduce-Scatter: Each GPU splits its gradient buffer into N chunks. In N-1 steps, each GPU sends one chunk to the next GPU in the ring and receives one chunk from the previous GPU, accumulating (adding) the received data. After N-1 steps, each GPU has the fully reduced version of one chunk.

Phase 2 — AllGather: In N-1 more steps, each GPU sends its fully reduced chunk around the ring so every GPU ends up with all N fully reduced chunks.

Why ring AllReduce is efficient:

Total data transferred per GPU: 2 x (N-1)/N x data_size. This approaches 2x data_size regardless of N — it scales well.
Each GPU only communicates with its two neighbors, maximizing bandwidth utilization.
The algorithm saturates the interconnect bandwidth, achieving near-theoretical throughput on NVLink.

Alternatives: Tree AllReduce (lower latency for small messages), recursive halving-doubling (good for power-of-2 GPU counts), and NCCL's adaptive algorithms that choose the best strategy based on message size and topology.

Q3: What is NCCL and why is it important for distributed training?

💡

Answer: NCCL (NVIDIA Collective Communications Library, pronounced "nickel") is NVIDIA's optimized library for multi-GPU and multi-node collective communication. It provides AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter operations.

Why NCCL matters:

Topology-aware: NCCL detects the GPU interconnect topology (NVLink, NVSwitch, PCIe, InfiniBand) and automatically selects the optimal communication algorithm. It knows that intra-node NVLink is faster than inter-node InfiniBand and routes accordingly.
Hierarchical AllReduce: For multi-node training, NCCL first does intra-node AllReduce over NVLink, then inter-node AllReduce over InfiniBand, then broadcasts results within nodes. This minimizes expensive cross-network transfers.
Kernel fusion: NCCL fuses communication operations with computation when possible, overlapping gradient AllReduce with backward pass computation.
GDR (GPUDirect RDMA): Enables direct GPU-to-GPU data transfer across nodes without going through CPU memory. Reduces latency and increases bandwidth for inter-node communication.

Common NCCL debugging:

NCCL_DEBUG=INFO: Enable verbose logging to see topology detection, algorithm selection, and ring/tree construction.
NCCL_IB_DISABLE=1: Force TCP instead of InfiniBand for debugging network issues.
NCCL_P2P_DISABLE=1: Disable peer-to-peer (NVLink) for debugging inter-GPU communication.
NCCL timeout errors usually indicate a straggler GPU (hardware fault, thermal throttling) or network issue, not a NCCL bug.

Q4: Explain model parallelism: tensor parallelism vs pipeline parallelism.

💡

Answer: Model parallelism splits the model across GPUs instead of replicating it. Two main approaches:

Tensor Parallelism (TP): Splits individual layers across GPUs. For a transformer, the attention and FFN weight matrices are partitioned column-wise or row-wise across GPUs. Each GPU computes a portion of the layer output, then GPUs exchange results via AllReduce/AllGather.

Pro: Low memory per GPU (weights split evenly). Simple to implement with Megatron-LM patterns.
Con: Requires high-bandwidth interconnect (NVLink) because every layer requires communication. Not practical across nodes with InfiniBand — too much latency.
Typical usage: TP=2, 4, or 8 within a single NVLink-connected node.

Pipeline Parallelism (PP): Splits the model by layers across GPUs. GPU 0 gets layers 1–10, GPU 1 gets layers 11–20, etc. Micro-batches are pipelined through stages to minimize bubble overhead.

Pro: Only sends activations between stages (smaller than AllReduce). Works across nodes with InfiniBand because communication is point-to-point and less frequent.
Con: Pipeline bubble — GPUs are idle while waiting for their stage's input. Bubble ratio = (P-1)/(M+P-1) where P = pipeline stages, M = micro-batches. Mitigated by using many micro-batches.
Typical usage: PP=2, 4, 8 across nodes.

3D Parallelism (DP + TP + PP): Modern large model training uses all three. Example: training a 175B model on 512 GPUs with DP=64, TP=4, PP=2. TP within NVLink nodes, PP across nodes, DP across all groups.

Q5: What is DeepSpeed and how do its ZeRO optimization stages work?

💡

Answer: DeepSpeed is Microsoft's open-source library for efficient distributed training. Its flagship feature is ZeRO (Zero Redundancy Optimizer), which eliminates memory redundancy in data parallelism.

The problem ZeRO solves: In standard data parallelism, every GPU stores a full copy of model parameters, gradients, and optimizer states. For a 7B FP16 model with Adam: 14 GB params + 14 GB gradients + 56 GB optimizer = 84 GB per GPU. With 8 GPUs, total memory used = 672 GB, but unique data = 84 GB. That is 8x redundancy.

ZeRO stages:

ZeRO-1 (Optimizer State Partitioning): Shards optimizer states across GPUs. Each GPU stores 1/N of optimizer states. Memory reduction: ~4x for Adam. Communication: same as standard data parallelism.
ZeRO-2 (+ Gradient Partitioning): Also shards gradients. Each GPU only stores gradients for its optimizer state partition. Memory reduction: ~8x. Communication: slightly more than ZeRO-1 (reduce-scatter instead of AllReduce for gradients).
ZeRO-3 (+ Parameter Partitioning): Also shards model parameters. GPUs gather parameters on-demand during forward and backward passes, then discard them. Memory reduction: linear with GPU count. Communication: significantly more — AllGather for params during forward and backward.

ZeRO-Offload/Infinity: Extends ZeRO-3 by offloading optimizer states, gradients, or parameters to CPU memory or NVMe SSDs. Enables training a 1T parameter model on a single DGX node at the cost of slower training speed.

When to use each stage: ZeRO-1 for most cases (free memory savings, no speed penalty). ZeRO-2 when ZeRO-1 is insufficient. ZeRO-3 only when the model cannot fit any other way — the communication overhead is significant.

Q6: What is FSDP and how does it compare to DeepSpeed ZeRO?

💡

Answer: FSDP (Fully Sharded Data Parallelism) is PyTorch's native implementation of ZeRO-3-style training. It shards model parameters, gradients, and optimizer states across GPUs and gathers them on-demand.

How FSDP works:

Model is wrapped in FSDP, which flattens and shards parameters across GPUs
During forward pass: AllGather to reconstruct full parameters for each FSDP unit, compute, then discard non-local parameters
During backward pass: AllGather parameters again, compute gradients, ReduceScatter to shard gradients
Each GPU updates only its shard of optimizer states and parameters

FSDP vs DeepSpeed ZeRO:

Feature	PyTorch FSDP	DeepSpeed ZeRO
Integration	Native PyTorch, minimal code changes	Requires DeepSpeed launcher and config
Flexibility	ZeRO-3 equivalent only	ZeRO-1, 2, 3, Offload, Infinity
Sharding strategy	FULL_SHARD, SHARD_GRAD_OP, NO_SHARD	Stage 1, 2, 3 with fine-grained control
CPU offloading	Supported but less mature	Well-tested ZeRO-Offload and ZeRO-Infinity
Ecosystem	Works with any PyTorch code, HuggingFace native	Broader feature set: sparse attention, compression

Recommendation: Use FSDP for pure PyTorch projects where simplicity matters. Use DeepSpeed when you need ZeRO-1/2 (less overhead than full sharding), CPU/NVMe offloading, or advanced features like sparse attention and communication compression.

Q7: How do you handle fault tolerance in large-scale distributed training?

💡

Answer: At scale (hundreds to thousands of GPUs), hardware failures are expected, not exceptional. A 1,000-GPU cluster with 0.1% daily failure rate per GPU will see ~1 GPU failure per day. Fault tolerance must be designed into the system.

Checkpointing strategies:

Periodic checkpointing: Save model state, optimizer state, and RNG state every N steps. Trade-off: frequent checkpointing (every 100 steps) wastes I/O bandwidth; infrequent (every 10,000 steps) wastes compute on recovery. Typical: every 500–1,000 steps or 15–30 minutes.
Asynchronous checkpointing: Snapshot state to CPU memory in the background while training continues. Write to persistent storage asynchronously. Libraries: PyTorch DCP (Distributed Checkpoint), Nebula (Azure).
Sharded checkpointing: Each GPU saves its shard independently. Avoids bottleneck of gathering full state on one GPU. Enables resume with different GPU count.

Elastic training:

PyTorch Elastic (TorchElastic): Automatically detects failed workers, restarts them, and resumes training from the latest checkpoint. Supports scaling up/down without restarting the entire job.
In-memory redundancy: Replicate optimizer state across multiple GPUs so that a single GPU failure does not require loading from disk. Faster recovery at the cost of memory.
Heartbeat monitoring: Each worker sends periodic heartbeats. If a heartbeat is missed, the coordinator marks the worker as failed and triggers recovery.

Infrastructure-level fault tolerance:

Health checks before launching training: verify GPU ECC errors, NVLink status, InfiniBand link health, NCCL connectivity tests
Automatic GPU replacement: detect faulty GPUs via DCGM, drain the node, replace or disable the GPU, re-add to scheduler
Network redundancy: dual-rail InfiniBand (two independent networks) so a single switch failure does not halt training

Q8: What is gradient accumulation and when should you use it?

💡

Answer: Gradient accumulation simulates a larger batch size by accumulating gradients over multiple forward-backward passes before performing a weight update. Instead of updating weights every mini-batch, you update every K mini-batches.

Why it matters:

Memory constraint: When the desired batch size does not fit in GPU memory, gradient accumulation lets you achieve the same effective batch size with a smaller per-GPU batch size.
Communication reduction: In data parallelism, you only need to AllReduce gradients once per weight update, not once per micro-batch. With K accumulation steps, communication is reduced by K.
Example: You want effective batch size 1024, have 8 GPUs, and each GPU can fit batch size 16. Per-GPU effective: 1024/8 = 128. Accumulation steps: 128/16 = 8. Each GPU does 8 forward-backward passes, accumulates gradients, then does one AllReduce and weight update.

Important details:

Loss must be scaled by 1/K (or equivalently, use mean reduction and do not scale)
Batch normalization statistics are computed per micro-batch, not per effective batch — this can affect convergence. Use Group Normalization or Layer Normalization for large models.
Learning rate should be scaled relative to effective batch size, not micro-batch size

Q9: How do you scale distributed training from 8 GPUs to 1,000+ GPUs? What changes?

💡

Answer: Scaling from 8 to 1,000+ GPUs introduces qualitative changes, not just quantitative ones:

Communication becomes the bottleneck:

At 8 GPUs (single node), NVLink provides 900 GB/s. Communication is nearly free.
At 1,000 GPUs (125 nodes), inter-node communication crosses InfiniBand at 400 Gbps (50 GB/s) per node. AllReduce time grows with the number of nodes.
Solution: Use 3D parallelism. TP within nodes (NVLink-fast), PP across a few nodes (point-to-point, moderate bandwidth needed), DP across all node groups (AllReduce, but only within DP groups, not globally).

Gradient compression and overlap:

Overlap AllReduce with backward computation: start AllReduce for layer N's gradients while computing layer N-1's gradients
Gradient compression (1-bit Adam, PowerSGD) reduces communication volume by 10–100x at the cost of slight convergence differences

Batch size and learning rate:

1,000 GPUs with batch 32 per GPU = effective batch 32,000. Very large batches can destabilize training.
Use linear scaling rule (scale LR proportionally to batch size) with warmup period
LARS/LAMB optimizers designed specifically for large-batch training

Reliability and operations:

At 1,000 GPUs, expect 1–3 hardware failures per day. Elastic training and checkpointing are mandatory.
Need a job scheduler (Slurm, Kubernetes + Volcano) to manage GPU allocation across teams
Pre-flight health checks on all GPUs before launching multi-day training runs

Q10: What is the difference between synchronous and asynchronous distributed training?

💡

Answer:

Synchronous training (standard approach): All GPUs compute gradients, then wait for AllReduce to complete before updating weights. Every GPU has identical model weights at all times.

Pro: Mathematically equivalent to single-GPU training with larger batch size. Deterministic. No staleness.
Con: Slowest GPU determines iteration time (straggler problem). All GPUs blocked during AllReduce.

Asynchronous training: Each GPU computes gradients and updates a shared parameter server without waiting for other GPUs. GPUs may read stale weights.

Pro: No straggler problem. Higher GPU utilization. Scales to very large clusters.
Con: Gradient staleness degrades convergence. Model weights on different GPUs diverge. Harder to debug and reproduce.

In practice: Synchronous training dominates modern large-scale training (GPT-4, Llama, etc.) because:

NVLink and InfiniBand have made communication fast enough that AllReduce overhead is manageable
Convergence guarantees are critical when a training run costs millions of dollars
Straggler mitigation (backup workers, health checks, elastic training) addresses synchronous training's weakness

Asynchronous SGD is still used for very large-scale recommendation models (e.g., Meta's DLRM) where the embedding tables are too large for all-reduce and parameter servers are used for sparse updates.

← Previous GPU & Compute Questions Next → Kubernetes for ML Questions