InfiniBand for AI Intermediate
InfiniBand is the dominant networking technology for AI training clusters, offering higher bandwidth and lower latency than Ethernet. This lesson covers InfiniBand architecture, speed generations (HDR, NDR, XDR), subnet management, and practical deployment considerations for AI infrastructure.
InfiniBand Speed Generations
| Generation | Per-Lane Speed | 4x Port Speed | Typical AI Use |
|---|---|---|---|
| HDR | 50 Gbps | 200 Gbps | A100 clusters |
| NDR | 100 Gbps | 400 Gbps | H100 clusters |
| XDR | 200 Gbps | 800 Gbps | Next-gen clusters |
InfiniBand Architecture
- Host Channel Adapter (HCA) — Network interface card installed in each GPU server (e.g., ConnectX-7)
- InfiniBand switches — High-radix switches (Quantum-2 with 64 NDR ports) forming the fabric
- Subnet Manager (SM) — Software that manages routing, discovers topology, and handles failover
- Cables — Copper (up to 2m) or active optical cables (up to 100m) connecting nodes to switches
InfiniBand vs Ethernet for AI
| Aspect | InfiniBand NDR | RoCE (RDMA over Ethernet) |
|---|---|---|
| Bandwidth | 400 Gbps | 400 Gbps (800GbE emerging) |
| Latency | ~0.5 microseconds | ~1-2 microseconds |
| Congestion control | Credit-based (lossless) | PFC/ECN (complex to tune) |
| Cost | Higher per port | Lower, uses existing Ethernet infrastructure |
| Ecosystem | HPC-focused | Broader enterprise compatibility |
Deployment Considerations
- Dual-rail networking — Use two InfiniBand HCAs per node for redundancy and doubled bandwidth
- NUMA awareness — Connect each HCA to the same NUMA node as its associated GPUs for optimal DMA performance
- Subnet Manager placement — Run standby SMs on multiple switches for high availability
Pro Tip: When building a new AI cluster, always choose the latest InfiniBand generation your budget allows. The bandwidth improvement directly translates to better distributed training scaling, and the cost difference is small relative to the GPU investment.
Ready to Learn RDMA?
The next lesson covers RDMA technology that enables zero-copy data transfers over InfiniBand and Ethernet.
Next: RDMA →
Lilly Tech Systems