NFS/Lustre for AI Intermediate
Shared file systems provide the common data layer that allows multiple GPU nodes to access the same training datasets, model checkpoints, and artifacts. This lesson compares NFS, Lustre, and GPFS for AI workloads, covering deployment, performance tuning, and when to use each option.
File System Comparison for AI
| Feature | NFS | Lustre | GPFS/Spectrum Scale |
|---|---|---|---|
| Aggregate throughput | 1-10 GB/s | 100+ GB/s | 100+ GB/s |
| Complexity | Simple | Moderate | High |
| Cost | Low | Medium | High (licensed) |
| Scale | Small clusters | Large HPC/AI clusters | Enterprise-wide |
| Best for | Dev/staging, <16 nodes | Training clusters, >16 nodes | Multi-workload enterprise |
Lustre for AI Training
Lustre is the most common file system for large AI training clusters because it scales linearly with storage targets:
- OSTs (Object Storage Targets) — Stripe data across multiple OSTs for parallel read throughput
- MDTs (Metadata Targets) — Handle file metadata operations; critical for workloads with many small files
- Stripe configuration — Set stripe count to match the number of readers for maximum throughput
- DNE (Distributed Namespace) — Distribute metadata across multiple MDTs for workloads with millions of files
NFS for Smaller AI Deployments
NFS v4.1+ with parallel NFS (pNFS) is a good choice for smaller AI deployments:
- Easy setup — Standard Linux packages, no specialized hardware
- Good enough — For clusters with 1-16 GPU nodes, NFS provides sufficient throughput
- Cloud-managed — AWS EFS, GCP Filestore, Azure Files provide managed NFS
- Limitations — Single server bottleneck; metadata operations can become slow with millions of files
Kubernetes Integration
Both NFS and Lustre can be exposed to Kubernetes pods via CSI drivers and PersistentVolumes. Use ReadWriteMany access mode to allow multiple training pods to read the same dataset simultaneously.
Ready to Learn Cache Strategies?
The next lesson covers caching techniques to maximize data loading performance.
Next: Cache Strategies →
Lilly Tech Systems