RDMA for AI Intermediate

Remote Direct Memory Access (RDMA) enables data transfer between machines without involving the operating system kernel, achieving near-hardware-speed communication. For AI training, RDMA combined with GPUDirect allows gradient tensors to flow directly from one GPU's memory to another GPU across the network, bypassing both CPUs entirely.

How RDMA Works

  • Zero-copy — Data moves directly between application memory buffers without kernel buffering
  • Kernel bypass — The network adapter reads/writes memory without CPU intervention
  • Hardware offload — Protocol processing is handled by the network adapter, freeing CPU cycles
  • Result — Latency drops from ~10 microseconds (TCP) to ~1 microsecond (RDMA), with near-line-rate throughput

GPUDirect Technologies

TechnologyDescriptionBenefit
GPUDirect P2PDirect GPU-to-GPU transfer over PCIeBypasses host memory for intra-node GPU communication
GPUDirect RDMADirect GPU-to-NIC transferNetwork adapter reads GPU memory directly; GPU-to-GPU across nodes without CPU
GPUDirect StorageDirect GPU-to-storage transferLoad datasets directly into GPU memory from NVMe drives

NCCL: The Communication Library

NVIDIA Collective Communications Library (NCCL) is the standard library for multi-GPU and multi-node communication in AI training. It automatically selects the best communication path:

  • Intra-node — Uses NVLink/NVSwitch for GPU-to-GPU communication
  • Inter-node — Uses GPUDirect RDMA over InfiniBand or RoCE
  • Algorithms — Ring all-reduce, tree all-reduce, and hybrid algorithms optimized for different message sizes
Bash
# Key NCCL environment variables for optimal performance
export NCCL_IB_DISABLE=0          # Enable InfiniBand
export NCCL_NET_GDR_LEVEL=5       # Enable GPUDirect RDMA
export NCCL_IB_HCA=mlx5           # Specify InfiniBand HCA
export NCCL_ALGO=Ring             # Or Tree, depending on topology
Performance Tip: Always verify GPUDirect RDMA is working by checking NCCL logs for "NET/IB" and "GDR" messages. Falling back to host memory staging can halve your inter-node bandwidth and significantly impact training throughput.

Ready to Learn NVLink/NVSwitch?

The next lesson covers NVIDIA's GPU-to-GPU interconnects for intra-node communication.

Next: NVLink/NVSwitch →