NVLink/NVSwitch for AI Intermediate

NVLink is NVIDIA's proprietary high-speed GPU-to-GPU interconnect, and NVSwitch is the fabric chip that connects all GPUs within a node into a fully connected topology. Together, they provide bandwidth orders of magnitude higher than PCIe, enabling efficient multi-GPU training without inter-node communication overhead.

NVLink Generations

Generation	GPU	Bandwidth per Link	Total per GPU
NVLink 3.0	A100	50 GB/s	600 GB/s (12 links)
NVLink 4.0	H100	50 GB/s	900 GB/s (18 links)
NVLink 5.0	B200	50 GB/s	1800 GB/s (36 links)

NVSwitch Architecture

NVSwitch provides all-to-all GPU communication within a node:

DGX A100 — 6 NVSwitches connecting 8 A100 GPUs with 600 GB/s bisection bandwidth
DGX H100 — 4 NVSwitches connecting 8 H100 GPUs with 900 GB/s per GPU
Full mesh — Every GPU can communicate with every other GPU at full bandwidth simultaneously
No hop penalty — Unlike PCIe tree topology, NVSwitch provides uniform bandwidth between any GPU pair

NVLink vs PCIe for AI Training

Bandwidth — NVLink 4.0 provides 900 GB/s vs PCIe Gen5 at 64 GB/s per direction (14x faster)
Latency — NVLink has lower latency than PCIe for GPU-to-GPU transfers
Tensor parallelism — Only practical over NVLink due to fine-grained, frequent communication between GPUs
Data parallelism — Benefits from NVLink for intra-node all-reduce, uses InfiniBand for inter-node

Monitoring NVLink Health

Bash

# Check NVLink status and throughput
nvidia-smi nvlink -s -i 0

# Check NVLink error counters
nvidia-smi nvlink -e -i 0

# Run NVLink bandwidth test
/usr/local/cuda/samples/bin/p2pBandwidthLatencyTest

Architecture Tip: When designing parallelism strategies, use tensor parallelism within NVLink-connected GPU groups and data/pipeline parallelism across nodes. This maximizes the use of high-bandwidth NVLink for the most communication-intensive operations.

Ready to Learn Network Topology?

The next lesson covers network topology design for AI clusters.

Next: Network Topology →

← RDMA Network Topology →