HPC Networking Best Practices Advanced
This final lesson covers the operational best practices for running production AI cluster networks, including congestion control, network monitoring, troubleshooting common issues, and performance tuning to achieve maximum distributed training throughput.
Congestion Control
- InfiniBand — Uses credit-based flow control that is inherently lossless; configure adaptive routing for multi-path load balancing
- RoCE — Requires Priority Flow Control (PFC) and ECN to prevent packet loss; careful DCBX configuration is essential
- NCCL tuning — Set
NCCL_MIN_NCHANNELSandNCCL_MAX_NCHANNELSto match your network bandwidth
Network Monitoring
| Metric | Tool | Alert Threshold |
|---|---|---|
| Port errors | ibdiagnet, perfquery | Any uncorrectable errors |
| Link throughput | ibstat, Prometheus IB exporter | <50% expected bandwidth |
| Congestion events | Switch telemetry | Sustained congestion >1 minute |
| NCCL performance | nccl-tests, application logs | All-reduce time regression >20% |
Troubleshooting Common Issues
- Slow all-reduce — Check for degraded links (ibdiagnet), verify GPUDirect RDMA is active, check NCCL topology detection
- Training hangs — Often caused by a single slow node; check per-node NCCL bandwidth with nccl-tests
- Packet drops on RoCE — Verify PFC is configured correctly on all switches in the path; check ECN marking thresholds
- Asymmetric performance — Check NUMA affinity between GPUs and NICs; verify all links are at expected speed
Performance Tuning Checklist
- Verify NVLink/NVSwitch health with
nvidia-smi nvlink -s - Confirm GPUDirect RDMA in NCCL logs (
NCCL_DEBUG=INFO) - Match NCCL algorithm to topology (ring for rail, tree for fat-tree)
- Set PCIe ACS (Access Control Services) to disabled for GPUDirect P2P
- Pin IRQs to the correct NUMA node for each NIC
- Run nccl-tests all-reduce benchmark to establish baseline performance
Course Complete: You now understand the networking technologies that enable large-scale AI training, from InfiniBand and RDMA to NVLink/NVSwitch and network topology design. Apply this knowledge to build, operate, and troubleshoot high-performance AI cluster networks.
Continue Learning
Explore AI data storage architecture to understand how to feed data to your GPU clusters efficiently.
AI Data Storage →
Lilly Tech Systems