RDMA for AI Intermediate
Remote Direct Memory Access (RDMA) enables data transfer between machines without involving the operating system kernel, achieving near-hardware-speed communication. For AI training, RDMA combined with GPUDirect allows gradient tensors to flow directly from one GPU's memory to another GPU across the network, bypassing both CPUs entirely.
How RDMA Works
- Zero-copy — Data moves directly between application memory buffers without kernel buffering
- Kernel bypass — The network adapter reads/writes memory without CPU intervention
- Hardware offload — Protocol processing is handled by the network adapter, freeing CPU cycles
- Result — Latency drops from ~10 microseconds (TCP) to ~1 microsecond (RDMA), with near-line-rate throughput
GPUDirect Technologies
| Technology | Description | Benefit |
|---|---|---|
| GPUDirect P2P | Direct GPU-to-GPU transfer over PCIe | Bypasses host memory for intra-node GPU communication |
| GPUDirect RDMA | Direct GPU-to-NIC transfer | Network adapter reads GPU memory directly; GPU-to-GPU across nodes without CPU |
| GPUDirect Storage | Direct GPU-to-storage transfer | Load datasets directly into GPU memory from NVMe drives |
NCCL: The Communication Library
NVIDIA Collective Communications Library (NCCL) is the standard library for multi-GPU and multi-node communication in AI training. It automatically selects the best communication path:
- Intra-node — Uses NVLink/NVSwitch for GPU-to-GPU communication
- Inter-node — Uses GPUDirect RDMA over InfiniBand or RoCE
- Algorithms — Ring all-reduce, tree all-reduce, and hybrid algorithms optimized for different message sizes
Bash
# Key NCCL environment variables for optimal performance export NCCL_IB_DISABLE=0 # Enable InfiniBand export NCCL_NET_GDR_LEVEL=5 # Enable GPUDirect RDMA export NCCL_IB_HCA=mlx5 # Specify InfiniBand HCA export NCCL_ALGO=Ring # Or Tree, depending on topology
Performance Tip: Always verify GPUDirect RDMA is working by checking NCCL logs for "NET/IB" and "GDR" messages. Falling back to host memory staging can halve your inter-node bandwidth and significantly impact training throughput.
Ready to Learn NVLink/NVSwitch?
The next lesson covers NVIDIA's GPU-to-GPU interconnects for intra-node communication.
Next: NVLink/NVSwitch →
Lilly Tech Systems