NFS/Lustre for AI Intermediate

Shared file systems provide the common data layer that allows multiple GPU nodes to access the same training datasets, model checkpoints, and artifacts. This lesson compares NFS, Lustre, and GPFS for AI workloads, covering deployment, performance tuning, and when to use each option.

File System Comparison for AI

FeatureNFSLustreGPFS/Spectrum Scale
Aggregate throughput1-10 GB/s100+ GB/s100+ GB/s
ComplexitySimpleModerateHigh
CostLowMediumHigh (licensed)
ScaleSmall clustersLarge HPC/AI clustersEnterprise-wide
Best forDev/staging, <16 nodesTraining clusters, >16 nodesMulti-workload enterprise

Lustre for AI Training

Lustre is the most common file system for large AI training clusters because it scales linearly with storage targets:

  • OSTs (Object Storage Targets) — Stripe data across multiple OSTs for parallel read throughput
  • MDTs (Metadata Targets) — Handle file metadata operations; critical for workloads with many small files
  • Stripe configuration — Set stripe count to match the number of readers for maximum throughput
  • DNE (Distributed Namespace) — Distribute metadata across multiple MDTs for workloads with millions of files

NFS for Smaller AI Deployments

NFS v4.1+ with parallel NFS (pNFS) is a good choice for smaller AI deployments:

  • Easy setup — Standard Linux packages, no specialized hardware
  • Good enough — For clusters with 1-16 GPU nodes, NFS provides sufficient throughput
  • Cloud-managed — AWS EFS, GCP Filestore, Azure Files provide managed NFS
  • Limitations — Single server bottleneck; metadata operations can become slow with millions of files

Kubernetes Integration

Both NFS and Lustre can be exposed to Kubernetes pods via CSI drivers and PersistentVolumes. Use ReadWriteMany access mode to allow multiple training pods to read the same dataset simultaneously.

Performance Tip: For AI training with many small files (e.g., image datasets), convert data to large sequential formats (TFRecord, WebDataset, Parquet) before storing on Lustre. This dramatically reduces metadata operations and increases throughput.

Ready to Learn Cache Strategies?

The next lesson covers caching techniques to maximize data loading performance.

Next: Cache Strategies →