Network Topology for AI Advanced

The network topology of an AI cluster determines the communication patterns, bandwidth availability, and fault tolerance for distributed training. This lesson covers the three main topology choices for AI clusters—fat-tree, dragonfly, and rail-optimized—with their trade-offs for different training workloads.

Fat-Tree Topology

The most common topology for AI clusters. A fat-tree provides full bisection bandwidth, meaning any half of the cluster can communicate with the other half at full aggregate bandwidth:

Structure — Leaf switches connect to servers, spine switches connect leaf switches
Bandwidth — Full bisection: every node can communicate at line rate simultaneously
Advantages — Predictable performance, easy to scale, well-understood behavior
Disadvantages — High switch and cable cost at scale, more cabling complexity
Best for — General-purpose AI clusters running diverse workloads

Rail-Optimized Topology

Rail-optimized topologies group network connections to match the GPU topology within each node. Each "rail" connects GPUs at the same position across multiple nodes:

Structure — GPU 0 of each node connects to Rail Switch 0, GPU 1 to Rail Switch 1, etc.
Advantage — NCCL ring all-reduce naturally maps to rails, reducing cross-rail traffic
Cost savings — 25-40% fewer switches than full fat-tree for the same node count
Trade-off — Less flexible than fat-tree; job placement must respect rail boundaries

Topology Comparison

Topology	Bisection BW	Cost	Flexibility	Best For
Fat-tree	Full	Highest	Maximum	Multi-tenant, diverse workloads
Rail-optimized	Partial	Medium	Medium	Large training jobs, homogeneous workloads
Dragonfly	Variable	Lowest	Lower	Very large clusters (10K+ GPUs)

Topology-Aware Job Scheduling

The scheduler must understand the network topology to place distributed training jobs optimally:

Locality preference — Place all pods of a training job on nodes connected to the same leaf switch
Rail awareness — In rail topologies, ensure jobs use GPUs on the same rails across nodes
Anti-affinity for inference — Spread inference replicas across different failure domains

Design Tip: For clusters under 256 GPUs, a full fat-tree topology is usually the best choice. The cost premium is small relative to the GPU investment, and the flexibility to run any workload on any set of nodes simplifies operations significantly.

Ready for Best Practices?

The final lesson covers production networking operations, monitoring, and performance tuning.

Next: Best Practices →

← NVLink/NVSwitch Best Practices →