Storage & Networking for AI
Storage and networking are often the hidden bottlenecks in AI infrastructure. These 10 questions cover distributed file systems, object storage performance, data loading optimization, high-performance networking, RDMA, and InfiniBand — topics that separate senior AI infrastructure engineers from generalists.
Q1: Compare distributed file systems for AI workloads: Lustre, GPFS, and cloud-native options.
| Feature | Lustre | GPFS (Spectrum Scale) | Cloud (FSx/Filestore/ANF) |
|---|---|---|---|
| Throughput | 1+ TB/s aggregate (scales with OSTs) | 500+ GB/s aggregate | Varies: FSx for Lustre up to 1 TB/s |
| Latency | Sub-millisecond for metadata, low for data | Sub-millisecond, excellent metadata performance | 1–5 ms (network overhead) |
| POSIX | Full POSIX compliance | Full POSIX compliance | Full POSIX (NFS/SMB) or partial (S3) |
| Scalability | Petabytes, thousands of clients | Exabytes, tens of thousands of clients | Petabytes, managed scaling |
| Best for | HPC, large-scale training, checkpointing | Enterprise, mixed workloads, AI + analytics | Cloud-native, managed operations |
| Operations | Complex: requires dedicated storage team | Complex: enterprise licensing, specialized admin | Managed: minimal operational burden |
Why POSIX matters for ML: PyTorch DataLoader, HuggingFace datasets, and most ML frameworks expect POSIX file system semantics (random access, directory listing, file locking). Object stores (S3) require adapter libraries (s3fs, fsspec) and do not support random access efficiently. For training workloads that read millions of small files (ImageNet: 14M images), POSIX file system performance is 5–10x better than S3.
Cloud recommendation: AWS FSx for Lustre for training workloads (auto-imports from S3, high throughput). GCP Filestore for moderate workloads. For checkpoint storage, use S3/GCS directly — checkpoints are large sequential writes where object storage excels.
Q2: How do you optimize data loading to prevent GPU starvation during training?
Answer: GPU starvation occurs when GPUs finish processing a batch before the next batch is loaded into GPU memory. The symptom: GPU utilization drops to 0% periodically (visible in nvidia-smi dmon as utilization oscillating between 0% and 100%).
Diagnosis:
- Profile with PyTorch Profiler or Nsight Systems. Look for gaps between GPU kernel executions — these are data loading stalls.
- Monitor
data_loading_time / total_step_time. If >10%, data loading is a bottleneck.
Optimization strategies (in order of effectiveness):
- Multi-worker data loading: PyTorch DataLoader with
num_workers=4–16. Each worker runs in a separate process, prefetching batches in parallel with GPU computation. Optimal num_workers depends on CPU cores, I/O throughput, and data processing complexity. - Prefetching:
prefetch_factor=2–4in DataLoader. Loads the next N batches while the current batch is being processed on GPU. Also usepin_memory=Truefor faster CPU-to-GPU transfers via DMA. - Optimize data format: Convert datasets to sequential-read-friendly formats. WebDataset (tar files of shards), TFRecord, or Mosaic StreamingDataset. These eliminate random I/O from reading millions of small files.
- Local NVMe caching: Copy dataset to instance local NVMe SSD before training. NVMe delivers 3–7 GB/s read throughput vs 500 MB/s for network-attached storage. Use the first epoch to warm the cache.
- On-GPU preprocessing: Move data augmentation (resize, crop, normalize) to GPU using NVIDIA DALI. Offloads work from CPU data workers and avoids CPU-GPU transfer of unprocessed data.
- Streaming datasets: For datasets too large to cache locally, use streaming data loaders (Mosaic StreamingDataset, WebDataset) that read sequentially from object storage with prefetching.
Q3: What is the role of object storage (S3/GCS) in AI infrastructure?
Answer: Object storage is the backbone of AI data infrastructure, used for virtually everything except hot data access during training:
Use cases:
- Training data lake: Raw and processed training datasets. Versioned with tools like DVC, LakeFS, or S3 versioning. Partitioned by date, source, or processing stage.
- Model artifacts: Trained model weights, ONNX exports, TensorRT engines. Versioned in model registries (MLflow, Weights & Biases).
- Checkpoints: Training checkpoints saved every 15–30 minutes. Large (tens to hundreds of GB each). Lifecycle policies to delete old checkpoints automatically.
- Logs and metrics: Training logs, profiling data, experiment metadata. Low-cost long-term storage.
Performance characteristics:
- Throughput: S3 delivers 100+ Gbps aggregate throughput from a single bucket with sufficient parallelism. Single-stream: ~100 MB/s per request.
- Latency: 50–100 ms first-byte latency. Not suitable for random access patterns during training.
- Cost: $0.023/GB/month (S3 Standard). 100 TB = $2,300/month. S3 Intelligent-Tiering can reduce costs 40–60% for infrequently accessed data.
Best practices:
- Use S3 Express One Zone (single-digit ms latency) for hot data that needs faster access than standard S3
- Convert small files to large sequential files (WebDataset tar shards of 256 MB–1 GB) for efficient reads
- Use multipart upload for checkpoints (parallel upload of 100 MB chunks)
- Enable S3 Transfer Acceleration for cross-region uploads
Q4: What is InfiniBand and why is it essential for large-scale AI training?
Answer: InfiniBand (IB) is a high-bandwidth, low-latency networking technology designed for HPC and AI workloads. It is the standard interconnect for multi-node GPU training.
Bandwidth generations:
- HDR (High Data Rate): 200 Gbps per port (25 GB/s)
- NDR (Next Data Rate): 400 Gbps per port (50 GB/s)
- XDR (Extended Data Rate): 800 Gbps per port (100 GB/s) — available 2025
Why InfiniBand over Ethernet for AI:
- Latency: IB: 0.5–1 microsecond. RoCE (RDMA over Converged Ethernet): 1–2 microseconds. Standard TCP Ethernet: 50–100 microseconds. Lower latency = less GPU idle time waiting for communication.
- RDMA (Remote Direct Memory Access): Data transfers bypass the CPU and OS kernel entirely. GPU memory on one node can be read/written directly by a remote GPU. Eliminates CPU overhead and copy latency.
- Congestion control: IB has hardware-level credit-based flow control. No packet drops due to congestion. Ethernet relies on software-level congestion control (ECN, PFC) which is harder to tune and less reliable.
- Consistent performance: IB provides guaranteed bandwidth per connection. Ethernet in shared data centers can have variable performance due to congestion from other tenants.
Cloud availability: AWS p5 instances have EFA (Elastic Fabric Adapter) — 3,200 Gbps with RDMA support. Azure ND H100 v5 has InfiniBand NDR. GCP uses GPUDirect-TCPX (optimized Ethernet, not IB). For the largest training runs, bare-metal servers with InfiniBand are preferred over cloud for consistent network performance.
Q5: Explain RDMA and GPUDirect. Why do they matter for distributed training?
Answer: RDMA and GPUDirect eliminate data copy overhead in distributed training communication paths:
Without RDMA/GPUDirect (traditional path):
- GPU memory → CPU memory (PCIe DMA)
- CPU memory → network stack (kernel processing, TCP/IP headers)
- Network → remote CPU memory (kernel processing)
- Remote CPU memory → remote GPU memory (PCIe DMA)
4 copies, 2 kernel transitions, high CPU overhead, high latency.
With GPUDirect RDMA:
- GPU memory → network adapter → remote GPU memory (single transfer, zero CPU involvement)
1 transfer, 0 kernel transitions, 0 CPU overhead, minimal latency.
GPUDirect technologies:
- GPUDirect P2P: Direct GPU-to-GPU memory access within a node via NVLink or PCIe. Used for intra-node AllReduce.
- GPUDirect RDMA (GDR): Direct GPU-to-network-adapter transfer. The NIC reads/writes GPU memory directly without CPU involvement. Used for inter-node AllReduce.
- GPUDirect Storage: Direct data path between GPU memory and NVMe storage, bypassing CPU. Used for fast checkpoint loading.
Impact on training: For a 7B model with FP16 gradients (~14 GB AllReduce), GPUDirect RDMA over NDR InfiniBand transfers this in ~280 ms (14 GB / 50 GB/s). Without GPUDirect, the same transfer takes 800+ ms due to CPU memory copies and kernel overhead. At 1,000 training steps, this saves 500+ seconds — significant for multi-day training runs.
Q6: How do you design the network topology for a large GPU training cluster?
Answer: Network topology directly impacts distributed training performance. The goal is providing enough bandwidth so that communication does not bottleneck GPU compute.
Common topologies:
- Fat-tree (most common): Hierarchical topology with leaf switches connecting servers and spine switches connecting leaf switches. Full bisection bandwidth means any server can communicate with any other at full line rate. Standard for clusters up to ~2,000 nodes.
- Rail-optimized (NVIDIA DGX SuperPOD): Each GPU in a node connects to a dedicated "rail" switch. GPU 0 across all nodes connects to switch 0, GPU 1 to switch 1, etc. Optimized for AllReduce patterns where each GPU communicates with the same-rank GPU on other nodes. Reduces switch hops and improves collective operation performance.
- Dragonfly: Groups of switches form "routers" connected in a complete graph. High bandwidth within groups, efficient long-distance communication. Used in large HPC systems.
Bandwidth planning:
- Rule of thumb: Network bandwidth per GPU should be at least 1/10 of GPU compute throughput (in bytes) to avoid being network-bound. H100 at 989 TFLOPS FP16 → need ~100 GB/s per GPU network bandwidth.
- DGX H100: 8x 400G InfiniBand ports = 3,200 Gbps (400 GB/s). Divided by 8 GPUs = 50 GB/s per GPU. Sufficient for most workloads but can bottleneck on large AllReduce operations.
- Oversubscription: In fat-tree, spine-to-leaf bandwidth can be oversubscribed (e.g., 2:1). For AI training, use 1:1 (non-blocking) at least within training partitions. Oversubscribe only between partitions that do not communicate frequently.
Dual-rail networking: Use two independent InfiniBand fabrics for redundancy and doubled bandwidth. If one rail fails, training continues on the other at half bandwidth instead of failing completely.
Q7: How do you handle checkpoint storage for large model training?
Answer: Checkpointing large models creates significant storage challenges:
Size estimation: A 70B parameter model checkpoint includes model weights (140 GB in FP16), optimizer states (280 GB for Adam in FP32), and RNG/scheduler state (~1 GB). Total: ~420 GB per checkpoint. Saving every 30 minutes over a 30-day training run = 1,440 checkpoints = 600+ TB.
Checkpoint storage tiers:
- Hot tier (NVMe/RAM): Last 2–3 checkpoints. Available for immediate resume. Local NVMe on training nodes or RAM-disk.
- Warm tier (parallel file system): Last 24 hours of checkpoints. Lustre or FSx for Lustre. Fast enough for resume within minutes.
- Cold tier (object storage): All checkpoints beyond 24 hours. S3/GCS with lifecycle policies. Archive to S3 Glacier after 7 days for 90% cost reduction.
Optimization techniques:
- Sharded checkpointing: Each GPU saves its own shard independently. Avoids gathering all parameters to a single node (which requires 420 GB of host memory). Enables parallel writes across all nodes.
- Asynchronous checkpointing: Copy state to pinned CPU memory, then write to storage in background threads while training continues. Reduces checkpoint overhead from minutes to seconds of GPU idle time.
- Incremental checkpointing: Only save parameters that changed since the last checkpoint. Reduces checkpoint size by 50–80% for fine-tuning workloads where most layers are frozen.
- Compression: Checkpoint data compresses 2–3x with zstd. Trade-off: compression CPU overhead vs storage/bandwidth savings.
Q8: What network bandwidth is required for different distributed training strategies?
Answer: Different parallelism strategies have very different network requirements:
| Strategy | Data Transferred per Step | Pattern | Minimum Network |
|---|---|---|---|
| Data Parallelism | 2 x model_size (AllReduce) | All-to-all collective | 400G IB (NDR) for models >1B params |
| Tensor Parallelism | 2 x activation_size per layer | AllReduce within TP group, every layer | NVLink required (900 GB/s). Not feasible over network. |
| Pipeline Parallelism | activation_size per micro-batch | Point-to-point between stages | 100G Ethernet sufficient for most models |
| ZeRO-3 / FSDP | 3 x model_size (AllGather fwd + AllGather bwd + ReduceScatter) | AllGather + ReduceScatter | 400G IB (NDR) for models >1B params |
| Expert Parallelism (MoE) | tokens_per_expert x hidden_dim (All-to-All) | All-to-all permutation | 400G+ IB, very sensitive to network latency |
Key insight: Tensor parallelism communicates at every layer (40+ times per forward pass for a large transformer) with small, latency-sensitive messages. This is why it must use NVLink within a node. Pipeline parallelism communicates once per micro-batch per stage boundary with larger, bandwidth-sensitive messages. This is why it can work across nodes on InfiniBand. Data parallelism communicates once per training step with very large messages (full gradient size). It needs high bandwidth but tolerates higher latency.
Q9: How do you debug network performance issues in distributed training?
Answer: Network issues manifest as training slowdowns, NCCL timeouts, or inconsistent step times. Systematic debugging approach:
Step 1 — Identify the symptom:
- NCCL timeout: One or more GPUs are not responding to collective operations. Usually hardware fault or network partition.
- Slow training steps: Steps take 2–5x longer than expected. Communication is the bottleneck.
- Inconsistent step times: Most steps are fast, but occasional steps are 10x slower. Indicates network congestion or straggler node.
Step 2 — Diagnostic tools:
- NCCL debug logging:
NCCL_DEBUG=INFOshows topology detection, ring construction, and communication timings. - nccl-tests: Run all_reduce_perf to measure actual vs theoretical bandwidth between nodes. Compare against expected bandwidth (e.g., 400 Gbps NDR IB should deliver ~48 GB/s per direction).
- ibstat / ibdiagnet: Check InfiniBand link state, speed, and error counters. Look for symbol errors, link downed events, or reduced speed negotiation.
- ib_write_bw / ib_read_bw: Low-level RDMA bandwidth test between pairs of nodes. Isolates network from NCCL overhead.
- nvidia-smi nvlink -s: Check NVLink error counters and bandwidth utilization within a node.
Step 3 — Common causes and fixes:
- Faulty InfiniBand cable: Degraded speed (100G instead of 400G) or high error rate. Replace cable.
- Switch congestion: Oversubscribed network. Check switch port counters for drops. Add spine switches or reduce oversubscription.
- Incorrect NCCL configuration: NCCL using TCP instead of IB. Check
NCCL_IB_DISABLEis not set. Verify GPUDirect RDMA is enabled. - NUMA misalignment: GPU and InfiniBand adapter on different NUMA nodes. Fix: pin GPU and NIC to same NUMA node via device plugin configuration.
Q10: How do you design data pipelines for petabyte-scale training datasets?
Answer: Petabyte-scale datasets (Common Crawl, LAION, video datasets) require specialized data pipeline design:
Data format:
- Convert raw data to sharded sequential formats: WebDataset (tar shards), Parquet, or TFRecord
- Each shard: 256 MB–1 GB, containing hundreds to thousands of samples
- Sharding enables parallel reading from multiple workers and random shard-level shuffling without random file access
Storage architecture:
- Source of truth: Object storage (S3/GCS). Cheap, durable, scalable. 1 PB on S3 = ~$23,000/month.
- Training cache: High-performance parallel file system (FSx for Lustre). Auto-import from S3, provide POSIX interface to training jobs. Size for 1–2 epochs of active data.
- Node-local cache: NVMe SSDs on training nodes. Cache hot shards for the current epoch. Fastest access but limited capacity.
Shuffling at scale:
- Full random shuffling of 1 PB is impractical (requires random access to all data)
- Two-level shuffling: shuffle shard order (inter-shard), then shuffle samples within each shard (intra-shard)
- This provides sufficient randomness for training convergence with sequential read patterns
Data preprocessing at scale:
- Run preprocessing (tokenization, deduplication, filtering) as a batch job on CPU instances before training starts
- Use Apache Spark, Ray Data, or Dask for distributed preprocessing
- Tokenize text data offline and store tokenized shards — do not tokenize during training
- For image data: resize and normalize offline, store as binary tensors to eliminate per-sample CPU processing during training
Lilly Tech Systems