HPC Networking Best Practices Advanced

This final lesson covers the operational best practices for running production AI cluster networks, including congestion control, network monitoring, troubleshooting common issues, and performance tuning to achieve maximum distributed training throughput.

Congestion Control

  • InfiniBand — Uses credit-based flow control that is inherently lossless; configure adaptive routing for multi-path load balancing
  • RoCE — Requires Priority Flow Control (PFC) and ECN to prevent packet loss; careful DCBX configuration is essential
  • NCCL tuning — Set NCCL_MIN_NCHANNELS and NCCL_MAX_NCHANNELS to match your network bandwidth

Network Monitoring

MetricToolAlert Threshold
Port errorsibdiagnet, perfqueryAny uncorrectable errors
Link throughputibstat, Prometheus IB exporter<50% expected bandwidth
Congestion eventsSwitch telemetrySustained congestion >1 minute
NCCL performancenccl-tests, application logsAll-reduce time regression >20%

Troubleshooting Common Issues

  • Slow all-reduce — Check for degraded links (ibdiagnet), verify GPUDirect RDMA is active, check NCCL topology detection
  • Training hangs — Often caused by a single slow node; check per-node NCCL bandwidth with nccl-tests
  • Packet drops on RoCE — Verify PFC is configured correctly on all switches in the path; check ECN marking thresholds
  • Asymmetric performance — Check NUMA affinity between GPUs and NICs; verify all links are at expected speed

Performance Tuning Checklist

  • Verify NVLink/NVSwitch health with nvidia-smi nvlink -s
  • Confirm GPUDirect RDMA in NCCL logs (NCCL_DEBUG=INFO)
  • Match NCCL algorithm to topology (ring for rail, tree for fat-tree)
  • Set PCIe ACS (Access Control Services) to disabled for GPUDirect P2P
  • Pin IRQs to the correct NUMA node for each NIC
  • Run nccl-tests all-reduce benchmark to establish baseline performance
Course Complete: You now understand the networking technologies that enable large-scale AI training, from InfiniBand and RDMA to NVLink/NVSwitch and network topology design. Apply this knowledge to build, operate, and troubleshoot high-performance AI cluster networks.

Continue Learning

Explore AI data storage architecture to understand how to feed data to your GPU clusters efficiently.

AI Data Storage →