Lustre File System for AI
Deploy and configure the Lustre parallel file system for high-throughput AI training workloads, including cloud-managed options like Amazon FSx for Lustre.
What is Lustre?
Lustre is an open-source parallel distributed file system originally developed for HPC. It powers many of the world's largest supercomputers and is widely used for AI training because it delivers massive aggregate throughput by striping files across hundreds of storage targets.
Lustre Architecture for AI
Management Server (MGS)
Stores configuration information for all Lustre file systems in a cluster. Typically co-located with the first MDS.
Metadata Server (MDS/MDT)
Handles file system namespace operations: creates, opens, renames, and permission checks. AI workloads with millions of small files benefit from multiple MDTs.
Object Storage Server (OSS/OST)
Stores actual file data. Each OSS manages multiple OSTs (Object Storage Targets). More OSTs means more aggregate bandwidth for training data reads.
Lustre Client
Kernel module on compute nodes that mounts the file system. Handles striping logic, caching, and parallel I/O coordination.
Configuring Striping for AI Workloads
# Set wide striping for large training data files lfs setstripe -c -1 -S 4M /mnt/lustre/training-data/ # Stripe count -1 = all OSTs, stripe size 4MB # This maximizes read throughput for large files # For checkpoint directories (large sequential writes) lfs setstripe -c 4 -S 16M /mnt/lustre/checkpoints/ # Check current striping configuration lfs getstripe /mnt/lustre/training-data/dataset.bin # Monitor OST usage to detect imbalance lfs df -h /mnt/lustre
Amazon FSx for Lustre
FSx for Lustre provides fully managed Lustre file systems in AWS. It can be linked to S3 buckets, automatically importing and exporting data, which creates a powerful pattern for AI: long-term storage in S3 with high-performance Lustre for active training.
# Create a persistent Lustre file system linked to S3
aws fsx create-file-system \
--file-system-type LUSTRE \
--storage-capacity 7200 \
--storage-type SSD \
--lustre-configuration '{
"DeploymentType": "PERSISTENT_2",
"PerUnitStorageThroughput": 1000,
"DataCompressionType": "LZ4",
"ImportPath": "s3://ml-datasets-prod/training/",
"AutoImportPolicy": "NEW_CHANGED_DELETED"
}' \
--subnet-ids subnet-abc123
Performance Tuning for AI
| Parameter | Default | AI Recommended | Impact |
|---|---|---|---|
| Stripe Count | 1 | All OSTs (-1) | Throughput scales linearly |
| Stripe Size | 1 MB | 4-16 MB | Reduces metadata overhead |
| Client Cache | 128 MB | 1-4 GB | Reduces re-reads across epochs |
| Read-ahead | 40 MB | 256 MB | Prefetches sequential data |
Lilly Tech Systems