Intermediate

NFS at Scale for AI Workloads

Scale Network File System for AI training and inference using managed cloud services, pNFS parallelism, and hybrid tiering strategies.

NFS for AI: Practical Reality

NFS is the simplest shared file system to deploy and is universally supported. While it cannot match the raw throughput of Lustre or BeeGFS, modern managed NFS services have significantly improved performance and can serve many AI workloads, especially inference, model serving, and smaller-scale training.

Cloud-Managed NFS Options

Service	Max Throughput	Max IOPS	Best For
Amazon EFS	20+ GB/s (elastic)	500K+ (elastic)	Kubernetes AI workloads
Google Filestore	16 GB/s (Enterprise)	480K	Vertex AI, GKE ML
Azure NetApp Files	4.5 GB/s per volume	450K	Azure ML, HPC
AWS FSx for NetApp ONTAP	4 GB/s	160K	Multi-protocol AI environments

Optimizing NFS for ML Data Loading

Bash - NFS Mount Options for AI

# Optimized NFS mount for AI training data reads
mount -t nfs4 -o \
  nfsvers=4.1,\
  rsize=1048576,\
  wsize=1048576,\
  hard,\
  timeo=600,\
  retrans=2,\
  noresvport,\
  async \
  fs-12345.efs.us-east-1.amazonaws.com:/ /mnt/training-data

# Key options explained:
# rsize/wsize=1MB: Maximum read/write block size
# hard: Retry indefinitely (prevents training job crashes)
# async: Allows buffered writes for checkpoint performance

pNFS: Parallel NFS

Parallel NFS (pNFS) extends NFS 4.1 to support parallel data access. The metadata server provides layout information so clients can read data directly from multiple storage devices simultaneously, similar to how Lustre striping works.

When to Choose NFS vs Parallel File Systems

✅

Choose NFS When

Workloads need less than 10 GB/s aggregate throughput, teams want zero operational overhead with managed services, or inference serving needs simple shared model storage.

⚠

Avoid NFS When

Training jobs need more than 20 GB/s throughput, workloads involve millions of small files, or checkpoint writes from hundreds of processes create metadata storms.

✅

Best practice: Use NFS for model serving and inference where simplicity matters and throughput demands are moderate. Pair it with a local caching strategy where models are cached on instance storage after first download from NFS.

← Previous BeeGFS Next → Best Practices