Introduction to HPC Networking for AI Beginner
Networking is often the bottleneck in distributed AI training. When hundreds of GPUs need to synchronize gradients every few hundred milliseconds, the network must deliver terabytes of bandwidth with microsecond latency. This lesson introduces why networking is critical for AI and the technology stack that makes large-scale training possible.
Why Networking Matters for AI
In distributed data-parallel training, every GPU computes gradients on its batch of data, then all GPUs must communicate their gradients via an all-reduce operation. The time spent communicating determines the scaling efficiency:
- Perfect scaling — 8 GPUs train 8x faster than 1 GPU (communication time is zero)
- Real-world scaling — 8 GPUs typically train 6-7x faster due to communication overhead
- Poor networking — 8 GPUs may train only 3-4x faster, wasting 50%+ of GPU capacity on waiting
The AI Networking Stack
| Layer | Technology | Bandwidth |
|---|---|---|
| Intra-GPU | HBM Memory Bus | 2-3 TB/s |
| Intra-Node GPU-GPU | NVLink / NVSwitch | 600-900 GB/s |
| GPU-CPU | PCIe Gen5 | 64 GB/s per direction |
| Inter-Node | InfiniBand NDR / RoCE | 400 Gbps (50 GB/s) |
| Data Center | Ethernet / InfiniBand Fabric | Variable, multi-path |
Bandwidth vs Latency
- Bandwidth — How much data can be transferred per second. Critical for large all-reduce operations in data-parallel training with large models
- Latency — How quickly a small message can be delivered. Critical for pipeline-parallel training where stages exchange activations frequently
- Both matter — Modern AI training requires both high bandwidth (for gradient sync) and low latency (for control messages and small tensor exchanges)
Key Insight: A 10x improvement in inter-node bandwidth (e.g., from 100 Gbps Ethernet to 400 Gbps InfiniBand plus RDMA) can improve distributed training throughput by 30-50% for large models, because it reduces the communication bottleneck in all-reduce operations.
Ready to Learn InfiniBand?
The next lesson covers InfiniBand architecture, the dominant networking technology for AI clusters.
Next: InfiniBand →
Lilly Tech Systems