Network Topology for AI Advanced
The network topology of an AI cluster determines the communication patterns, bandwidth availability, and fault tolerance for distributed training. This lesson covers the three main topology choices for AI clusters—fat-tree, dragonfly, and rail-optimized—with their trade-offs for different training workloads.
Fat-Tree Topology
The most common topology for AI clusters. A fat-tree provides full bisection bandwidth, meaning any half of the cluster can communicate with the other half at full aggregate bandwidth:
- Structure — Leaf switches connect to servers, spine switches connect leaf switches
- Bandwidth — Full bisection: every node can communicate at line rate simultaneously
- Advantages — Predictable performance, easy to scale, well-understood behavior
- Disadvantages — High switch and cable cost at scale, more cabling complexity
- Best for — General-purpose AI clusters running diverse workloads
Rail-Optimized Topology
Rail-optimized topologies group network connections to match the GPU topology within each node. Each "rail" connects GPUs at the same position across multiple nodes:
- Structure — GPU 0 of each node connects to Rail Switch 0, GPU 1 to Rail Switch 1, etc.
- Advantage — NCCL ring all-reduce naturally maps to rails, reducing cross-rail traffic
- Cost savings — 25-40% fewer switches than full fat-tree for the same node count
- Trade-off — Less flexible than fat-tree; job placement must respect rail boundaries
Topology Comparison
| Topology | Bisection BW | Cost | Flexibility | Best For |
|---|---|---|---|---|
| Fat-tree | Full | Highest | Maximum | Multi-tenant, diverse workloads |
| Rail-optimized | Partial | Medium | Medium | Large training jobs, homogeneous workloads |
| Dragonfly | Variable | Lowest | Lower | Very large clusters (10K+ GPUs) |
Topology-Aware Job Scheduling
The scheduler must understand the network topology to place distributed training jobs optimally:
- Locality preference — Place all pods of a training job on nodes connected to the same leaf switch
- Rail awareness — In rail topologies, ensure jobs use GPUs on the same rails across nodes
- Anti-affinity for inference — Spread inference replicas across different failure domains
Ready for Best Practices?
The final lesson covers production networking operations, monitoring, and performance tuning.
Next: Best Practices →
Lilly Tech Systems