NFS/Lustre for AI Intermediate

Shared file systems provide the common data layer that allows multiple GPU nodes to access the same training datasets, model checkpoints, and artifacts. This lesson compares NFS, Lustre, and GPFS for AI workloads, covering deployment, performance tuning, and when to use each option.

File System Comparison for AI

Feature	NFS	Lustre	GPFS/Spectrum Scale
Aggregate throughput	1-10 GB/s	100+ GB/s	100+ GB/s
Complexity	Simple	Moderate	High
Cost	Low	Medium	High (licensed)
Scale	Small clusters	Large HPC/AI clusters	Enterprise-wide
Best for	Dev/staging, <16 nodes	Training clusters, >16 nodes	Multi-workload enterprise

Lustre for AI Training

Lustre is the most common file system for large AI training clusters because it scales linearly with storage targets:

OSTs (Object Storage Targets) — Stripe data across multiple OSTs for parallel read throughput
MDTs (Metadata Targets) — Handle file metadata operations; critical for workloads with many small files
Stripe configuration — Set stripe count to match the number of readers for maximum throughput
DNE (Distributed Namespace) — Distribute metadata across multiple MDTs for workloads with millions of files

NFS for Smaller AI Deployments

NFS v4.1+ with parallel NFS (pNFS) is a good choice for smaller AI deployments:

Easy setup — Standard Linux packages, no specialized hardware
Good enough — For clusters with 1-16 GPU nodes, NFS provides sufficient throughput
Cloud-managed — AWS EFS, GCP Filestore, Azure Files provide managed NFS
Limitations — Single server bottleneck; metadata operations can become slow with millions of files

Kubernetes Integration

Both NFS and Lustre can be exposed to Kubernetes pods via CSI drivers and PersistentVolumes. Use ReadWriteMany access mode to allow multiple training pods to read the same dataset simultaneously.

Performance Tip: For AI training with many small files (e.g., image datasets), convert data to large sequential formats (TFRecord, WebDataset, Parquet) before storing on Lustre. This dramatically reduces metadata operations and increases throughput.

Ready to Learn Cache Strategies?

The next lesson covers caching techniques to maximize data loading performance.

Next: Cache Strategies →

← Storage Tiers Cache Strategies →