Network Topology for AI Advanced

The network topology of an AI cluster determines the communication patterns, bandwidth availability, and fault tolerance for distributed training. This lesson covers the three main topology choices for AI clusters—fat-tree, dragonfly, and rail-optimized—with their trade-offs for different training workloads.

Fat-Tree Topology

The most common topology for AI clusters. A fat-tree provides full bisection bandwidth, meaning any half of the cluster can communicate with the other half at full aggregate bandwidth:

  • Structure — Leaf switches connect to servers, spine switches connect leaf switches
  • Bandwidth — Full bisection: every node can communicate at line rate simultaneously
  • Advantages — Predictable performance, easy to scale, well-understood behavior
  • Disadvantages — High switch and cable cost at scale, more cabling complexity
  • Best for — General-purpose AI clusters running diverse workloads

Rail-Optimized Topology

Rail-optimized topologies group network connections to match the GPU topology within each node. Each "rail" connects GPUs at the same position across multiple nodes:

  • Structure — GPU 0 of each node connects to Rail Switch 0, GPU 1 to Rail Switch 1, etc.
  • Advantage — NCCL ring all-reduce naturally maps to rails, reducing cross-rail traffic
  • Cost savings — 25-40% fewer switches than full fat-tree for the same node count
  • Trade-off — Less flexible than fat-tree; job placement must respect rail boundaries

Topology Comparison

TopologyBisection BWCostFlexibilityBest For
Fat-treeFullHighestMaximumMulti-tenant, diverse workloads
Rail-optimizedPartialMediumMediumLarge training jobs, homogeneous workloads
DragonflyVariableLowestLowerVery large clusters (10K+ GPUs)

Topology-Aware Job Scheduling

The scheduler must understand the network topology to place distributed training jobs optimally:

  • Locality preference — Place all pods of a training job on nodes connected to the same leaf switch
  • Rail awareness — In rail topologies, ensure jobs use GPUs on the same rails across nodes
  • Anti-affinity for inference — Spread inference replicas across different failure domains
Design Tip: For clusters under 256 GPUs, a full fat-tree topology is usually the best choice. The cost premium is small relative to the GPU investment, and the flexibility to run any workload on any set of nodes simplifies operations significantly.

Ready for Best Practices?

The final lesson covers production networking operations, monitoring, and performance tuning.

Next: Best Practices →