RDMA for AI Intermediate

Remote Direct Memory Access (RDMA) enables data transfer between machines without involving the operating system kernel, achieving near-hardware-speed communication. For AI training, RDMA combined with GPUDirect allows gradient tensors to flow directly from one GPU's memory to another GPU across the network, bypassing both CPUs entirely.

How RDMA Works

Zero-copy — Data moves directly between application memory buffers without kernel buffering
Kernel bypass — The network adapter reads/writes memory without CPU intervention
Hardware offload — Protocol processing is handled by the network adapter, freeing CPU cycles
Result — Latency drops from ~10 microseconds (TCP) to ~1 microsecond (RDMA), with near-line-rate throughput

GPUDirect Technologies

Technology	Description	Benefit
GPUDirect P2P	Direct GPU-to-GPU transfer over PCIe	Bypasses host memory for intra-node GPU communication
GPUDirect RDMA	Direct GPU-to-NIC transfer	Network adapter reads GPU memory directly; GPU-to-GPU across nodes without CPU
GPUDirect Storage	Direct GPU-to-storage transfer	Load datasets directly into GPU memory from NVMe drives

NCCL: The Communication Library

NVIDIA Collective Communications Library (NCCL) is the standard library for multi-GPU and multi-node communication in AI training. It automatically selects the best communication path:

Intra-node — Uses NVLink/NVSwitch for GPU-to-GPU communication
Inter-node — Uses GPUDirect RDMA over InfiniBand or RoCE
Algorithms — Ring all-reduce, tree all-reduce, and hybrid algorithms optimized for different message sizes

Bash

# Key NCCL environment variables for optimal performance
export NCCL_IB_DISABLE=0          # Enable InfiniBand
export NCCL_NET_GDR_LEVEL=5       # Enable GPUDirect RDMA
export NCCL_IB_HCA=mlx5           # Specify InfiniBand HCA
export NCCL_ALGO=Ring             # Or Tree, depending on topology

Performance Tip: Always verify GPUDirect RDMA is working by checking NCCL logs for "NET/IB" and "GDR" messages. Falling back to host memory staging can halve your inter-node bandwidth and significantly impact training throughput.

Ready to Learn NVLink/NVSwitch?

The next lesson covers NVIDIA's GPU-to-GPU interconnects for intra-node communication.

Next: NVLink/NVSwitch →

← InfiniBand NVLink/NVSwitch →