Intermediate

Lustre File System for AI

Deploy and configure the Lustre parallel file system for high-throughput AI training workloads, including cloud-managed options like Amazon FSx for Lustre.

What is Lustre?

Lustre is an open-source parallel distributed file system originally developed for HPC. It powers many of the world's largest supercomputers and is widely used for AI training because it delivers massive aggregate throughput by striping files across hundreds of storage targets.

Lustre Architecture for AI

  1. Management Server (MGS)

    Stores configuration information for all Lustre file systems in a cluster. Typically co-located with the first MDS.

  2. Metadata Server (MDS/MDT)

    Handles file system namespace operations: creates, opens, renames, and permission checks. AI workloads with millions of small files benefit from multiple MDTs.

  3. Object Storage Server (OSS/OST)

    Stores actual file data. Each OSS manages multiple OSTs (Object Storage Targets). More OSTs means more aggregate bandwidth for training data reads.

  4. Lustre Client

    Kernel module on compute nodes that mounts the file system. Handles striping logic, caching, and parallel I/O coordination.

Configuring Striping for AI Workloads

Bash - Lustre Striping
# Set wide striping for large training data files
lfs setstripe -c -1 -S 4M /mnt/lustre/training-data/

# Stripe count -1 = all OSTs, stripe size 4MB
# This maximizes read throughput for large files

# For checkpoint directories (large sequential writes)
lfs setstripe -c 4 -S 16M /mnt/lustre/checkpoints/

# Check current striping configuration
lfs getstripe /mnt/lustre/training-data/dataset.bin

# Monitor OST usage to detect imbalance
lfs df -h /mnt/lustre

Amazon FSx for Lustre

FSx for Lustre provides fully managed Lustre file systems in AWS. It can be linked to S3 buckets, automatically importing and exporting data, which creates a powerful pattern for AI: long-term storage in S3 with high-performance Lustre for active training.

AWS CLI - Create FSx for Lustre
# Create a persistent Lustre file system linked to S3
aws fsx create-file-system \
  --file-system-type LUSTRE \
  --storage-capacity 7200 \
  --storage-type SSD \
  --lustre-configuration '{
    "DeploymentType": "PERSISTENT_2",
    "PerUnitStorageThroughput": 1000,
    "DataCompressionType": "LZ4",
    "ImportPath": "s3://ml-datasets-prod/training/",
    "AutoImportPolicy": "NEW_CHANGED_DELETED"
  }' \
  --subnet-ids subnet-abc123
Best practice: Use FSx for Lustre with S3 data repository associations for training jobs. Data is lazily loaded from S3 on first access and cached on Lustre. After training, export results back to S3 for durability. This pattern gives you Lustre performance at near-S3 cost for infrequently accessed data.

Performance Tuning for AI

ParameterDefaultAI RecommendedImpact
Stripe Count1All OSTs (-1)Throughput scales linearly
Stripe Size1 MB4-16 MBReduces metadata overhead
Client Cache128 MB1-4 GBReduces re-reads across epochs
Read-ahead40 MB256 MBPrefetches sequential data