Intermediate

NFS at Scale for AI Workloads

Scale Network File System for AI training and inference using managed cloud services, pNFS parallelism, and hybrid tiering strategies.

NFS for AI: Practical Reality

NFS is the simplest shared file system to deploy and is universally supported. While it cannot match the raw throughput of Lustre or BeeGFS, modern managed NFS services have significantly improved performance and can serve many AI workloads, especially inference, model serving, and smaller-scale training.

Cloud-Managed NFS Options

ServiceMax ThroughputMax IOPSBest For
Amazon EFS20+ GB/s (elastic)500K+ (elastic)Kubernetes AI workloads
Google Filestore16 GB/s (Enterprise)480KVertex AI, GKE ML
Azure NetApp Files4.5 GB/s per volume450KAzure ML, HPC
AWS FSx for NetApp ONTAP4 GB/s160KMulti-protocol AI environments

Optimizing NFS for ML Data Loading

Bash - NFS Mount Options for AI
# Optimized NFS mount for AI training data reads
mount -t nfs4 -o \
  nfsvers=4.1,\
  rsize=1048576,\
  wsize=1048576,\
  hard,\
  timeo=600,\
  retrans=2,\
  noresvport,\
  async \
  fs-12345.efs.us-east-1.amazonaws.com:/ /mnt/training-data

# Key options explained:
# rsize/wsize=1MB: Maximum read/write block size
# hard: Retry indefinitely (prevents training job crashes)
# async: Allows buffered writes for checkpoint performance

pNFS: Parallel NFS

Parallel NFS (pNFS) extends NFS 4.1 to support parallel data access. The metadata server provides layout information so clients can read data directly from multiple storage devices simultaneously, similar to how Lustre striping works.

When to Choose NFS vs Parallel File Systems

Choose NFS When

Workloads need less than 10 GB/s aggregate throughput, teams want zero operational overhead with managed services, or inference serving needs simple shared model storage.

Avoid NFS When

Training jobs need more than 20 GB/s throughput, workloads involve millions of small files, or checkpoint writes from hundreds of processes create metadata storms.

Best practice: Use NFS for model serving and inference where simplicity matters and throughput demands are moderate. Pair it with a local caching strategy where models are cached on instance storage after first download from NFS.