HPC Networking Best Practices Advanced

This final lesson covers the operational best practices for running production AI cluster networks, including congestion control, network monitoring, troubleshooting common issues, and performance tuning to achieve maximum distributed training throughput.

Congestion Control

InfiniBand — Uses credit-based flow control that is inherently lossless; configure adaptive routing for multi-path load balancing
RoCE — Requires Priority Flow Control (PFC) and ECN to prevent packet loss; careful DCBX configuration is essential
NCCL tuning — Set NCCL_MIN_NCHANNELS and NCCL_MAX_NCHANNELS to match your network bandwidth

Network Monitoring

Metric	Tool	Alert Threshold
Port errors	ibdiagnet, perfquery	Any uncorrectable errors
Link throughput	ibstat, Prometheus IB exporter	<50% expected bandwidth
Congestion events	Switch telemetry	Sustained congestion >1 minute
NCCL performance	nccl-tests, application logs	All-reduce time regression >20%

Troubleshooting Common Issues

Slow all-reduce — Check for degraded links (ibdiagnet), verify GPUDirect RDMA is active, check NCCL topology detection
Training hangs — Often caused by a single slow node; check per-node NCCL bandwidth with nccl-tests
Packet drops on RoCE — Verify PFC is configured correctly on all switches in the path; check ECN marking thresholds
Asymmetric performance — Check NUMA affinity between GPUs and NICs; verify all links are at expected speed

Performance Tuning Checklist

Verify NVLink/NVSwitch health with nvidia-smi nvlink -s
Confirm GPUDirect RDMA in NCCL logs (NCCL_DEBUG=INFO)
Match NCCL algorithm to topology (ring for rail, tree for fat-tree)
Set PCIe ACS (Access Control Services) to disabled for GPUDirect P2P
Pin IRQs to the correct NUMA node for each NIC
Run nccl-tests all-reduce benchmark to establish baseline performance

Course Complete: You now understand the networking technologies that enable large-scale AI training, from InfiniBand and RDMA to NVLink/NVSwitch and network topology design. Apply this knowledge to build, operate, and troubleshoot high-performance AI cluster networks.

Continue Learning

Explore AI data storage architecture to understand how to feed data to your GPU clusters efficiently.

AI Data Storage →

← Network Topology Course Overview →