NVLink/NVSwitch for AI Intermediate

NVLink is NVIDIA's proprietary high-speed GPU-to-GPU interconnect, and NVSwitch is the fabric chip that connects all GPUs within a node into a fully connected topology. Together, they provide bandwidth orders of magnitude higher than PCIe, enabling efficient multi-GPU training without inter-node communication overhead.

NVLink Generations

GenerationGPUBandwidth per LinkTotal per GPU
NVLink 3.0A10050 GB/s600 GB/s (12 links)
NVLink 4.0H10050 GB/s900 GB/s (18 links)
NVLink 5.0B20050 GB/s1800 GB/s (36 links)

NVSwitch Architecture

NVSwitch provides all-to-all GPU communication within a node:

  • DGX A100 — 6 NVSwitches connecting 8 A100 GPUs with 600 GB/s bisection bandwidth
  • DGX H100 — 4 NVSwitches connecting 8 H100 GPUs with 900 GB/s per GPU
  • Full mesh — Every GPU can communicate with every other GPU at full bandwidth simultaneously
  • No hop penalty — Unlike PCIe tree topology, NVSwitch provides uniform bandwidth between any GPU pair

NVLink vs PCIe for AI Training

  • Bandwidth — NVLink 4.0 provides 900 GB/s vs PCIe Gen5 at 64 GB/s per direction (14x faster)
  • Latency — NVLink has lower latency than PCIe for GPU-to-GPU transfers
  • Tensor parallelism — Only practical over NVLink due to fine-grained, frequent communication between GPUs
  • Data parallelism — Benefits from NVLink for intra-node all-reduce, uses InfiniBand for inter-node

Monitoring NVLink Health

Bash
# Check NVLink status and throughput
nvidia-smi nvlink -s -i 0

# Check NVLink error counters
nvidia-smi nvlink -e -i 0

# Run NVLink bandwidth test
/usr/local/cuda/samples/bin/p2pBandwidthLatencyTest
Architecture Tip: When designing parallelism strategies, use tensor parallelism within NVLink-connected GPU groups and data/pipeline parallelism across nodes. This maximizes the use of high-bandwidth NVLink for the most communication-intensive operations.

Ready to Learn Network Topology?

The next lesson covers network topology design for AI clusters.

Next: Network Topology →