Introduction to HPC Networking for AI Beginner

Networking is often the bottleneck in distributed AI training. When hundreds of GPUs need to synchronize gradients every few hundred milliseconds, the network must deliver terabytes of bandwidth with microsecond latency. This lesson introduces why networking is critical for AI and the technology stack that makes large-scale training possible.

Why Networking Matters for AI

In distributed data-parallel training, every GPU computes gradients on its batch of data, then all GPUs must communicate their gradients via an all-reduce operation. The time spent communicating determines the scaling efficiency:

  • Perfect scaling — 8 GPUs train 8x faster than 1 GPU (communication time is zero)
  • Real-world scaling — 8 GPUs typically train 6-7x faster due to communication overhead
  • Poor networking — 8 GPUs may train only 3-4x faster, wasting 50%+ of GPU capacity on waiting

The AI Networking Stack

LayerTechnologyBandwidth
Intra-GPUHBM Memory Bus2-3 TB/s
Intra-Node GPU-GPUNVLink / NVSwitch600-900 GB/s
GPU-CPUPCIe Gen564 GB/s per direction
Inter-NodeInfiniBand NDR / RoCE400 Gbps (50 GB/s)
Data CenterEthernet / InfiniBand FabricVariable, multi-path

Bandwidth vs Latency

  • Bandwidth — How much data can be transferred per second. Critical for large all-reduce operations in data-parallel training with large models
  • Latency — How quickly a small message can be delivered. Critical for pipeline-parallel training where stages exchange activations frequently
  • Both matter — Modern AI training requires both high bandwidth (for gradient sync) and low latency (for control messages and small tensor exchanges)
Key Insight: A 10x improvement in inter-node bandwidth (e.g., from 100 Gbps Ethernet to 400 Gbps InfiniBand plus RDMA) can improve distributed training throughput by 30-50% for large models, because it reduces the communication bottleneck in all-reduce operations.

Ready to Learn InfiniBand?

The next lesson covers InfiniBand architecture, the dominant networking technology for AI clusters.

Next: InfiniBand →