BeeGFS for AI Workloads
Set up BeeGFS as a cost-effective parallel file system for AI training with buddy mirroring, flexible striping, and GPU-direct storage integration.
Why BeeGFS?
BeeGFS (originally FhGFS) is a parallel file system developed by the Fraunhofer Institute that has gained significant popularity for AI workloads. Its appeal lies in its simplicity of deployment, excellent small-file performance, and the ability to co-locate storage and compute on the same nodes.
BeeGFS Components
Management Service (beegfs-mgmtd)
Registry service that tracks all other BeeGFS services. Lightweight; only accessed during mount and service registration.
Metadata Service (beegfs-meta)
Handles directory and file metadata. Distributes metadata across multiple servers for parallel namespace operations.
Storage Service (beegfs-storage)
Stores file data chunks. Each server can manage multiple storage targets. Files are striped across targets for throughput.
Client (beegfs-client)
Kernel module that mounts the BeeGFS file system. Communicates directly with metadata and storage servers.
Configuration for AI Training
# Configure striping for training data beegfs-ctl --setpattern --chunksize=2M --numtargets=8 /mnt/beegfs/training/ # Enable buddy mirroring for checkpoint data beegfs-ctl --setpattern --buddymirror --numtargets=4 /mnt/beegfs/checkpoints/ # Check file distribution across storage targets beegfs-ctl --getentryinfo /mnt/beegfs/training/dataset.bin # Monitor storage target utilization beegfs-ctl --listtargets --nodetype=storage --state
BeeOND: BeeGFS On Demand
BeeOND creates ephemeral BeeGFS file systems from local drives on allocated compute nodes. This is ideal for AI training jobs that need fast scratch space during a training run but do not need persistent storage.
# Create a BeeOND file system across allocated GPU nodes beeond start -n /path/to/nodefile -d /local/nvme -c /mnt/beeond # Stage training data from persistent storage beeond-cp stagein -n /path/to/nodefile \ -g /persistent/training-data/ -l /mnt/beeond/data/ # After training, stage out results beeond-cp stageout -n /path/to/nodefile \ -l /mnt/beeond/results/ -g /persistent/results/ # Tear down when job completes beeond stop -n /path/to/nodefile -L -d