Introduction to HPC Networking for AI Beginner

Networking is often the bottleneck in distributed AI training. When hundreds of GPUs need to synchronize gradients every few hundred milliseconds, the network must deliver terabytes of bandwidth with microsecond latency. This lesson introduces why networking is critical for AI and the technology stack that makes large-scale training possible.

Why Networking Matters for AI

In distributed data-parallel training, every GPU computes gradients on its batch of data, then all GPUs must communicate their gradients via an all-reduce operation. The time spent communicating determines the scaling efficiency:

Perfect scaling — 8 GPUs train 8x faster than 1 GPU (communication time is zero)
Real-world scaling — 8 GPUs typically train 6-7x faster due to communication overhead
Poor networking — 8 GPUs may train only 3-4x faster, wasting 50%+ of GPU capacity on waiting

The AI Networking Stack

Layer	Technology	Bandwidth
Intra-GPU	HBM Memory Bus	2-3 TB/s
Intra-Node GPU-GPU	NVLink / NVSwitch	600-900 GB/s
GPU-CPU	PCIe Gen5	64 GB/s per direction
Inter-Node	InfiniBand NDR / RoCE	400 Gbps (50 GB/s)
Data Center	Ethernet / InfiniBand Fabric	Variable, multi-path

Bandwidth vs Latency

Bandwidth — How much data can be transferred per second. Critical for large all-reduce operations in data-parallel training with large models
Latency — How quickly a small message can be delivered. Critical for pipeline-parallel training where stages exchange activations frequently
Both matter — Modern AI training requires both high bandwidth (for gradient sync) and low latency (for control messages and small tensor exchanges)

Key Insight: A 10x improvement in inter-node bandwidth (e.g., from 100 Gbps Ethernet to 400 Gbps InfiniBand plus RDMA) can improve distributed training throughput by 30-50% for large models, because it reduces the communication bottleneck in all-reduce operations.

Ready to Learn InfiniBand?

The next lesson covers InfiniBand architecture, the dominant networking technology for AI clusters.

Next: InfiniBand →

← Course Overview InfiniBand →