Azure CycleCloud for AI
Orchestrate HPC clusters on Azure with CycleCloud using Slurm, PBS, or custom schedulers for large-scale distributed AI training.
What is CycleCloud?
Azure CycleCloud is a tool for creating, managing, and optimizing HPC clusters on Azure. It supports popular job schedulers (Slurm, PBS Pro, Grid Engine) and provides auto-scaling, cost management, and monitoring for GPU clusters used in AI training.
Scheduler Integration
Native support for Slurm, PBS Pro, Grid Engine, and custom schedulers familiar to HPC and research teams.
Auto-Scaling
Automatically provisions and deprovisions GPU nodes based on job queue depth, scaling to zero when idle.
Cost Controls
Built-in cost tracking, budget limits, and Spot VM support with automatic preemption handling.
Enterprise Ready
VNet integration, managed identity, and RBAC for secure, compliant HPC cluster deployments.
CycleCloud Slurm Cluster for AI
# CycleCloud CLI - Create a GPU Slurm cluster
cyclecloud create_cluster slurm-gpu \
--parameter "Region=eastus" \
--parameter "SubnetId=/subscriptions/.../subnets/hpc" \
--parameter "SchedulerMachineType=Standard_D4s_v5" \
--parameter "HPCMachineType=Standard_ND96asr_v4" \
--parameter "MaxHPCExecuteCoreCount=768" \
--parameter "UseSpot=true"
# Submit a training job via Slurm
sbatch --nodes=4 --ntasks-per-node=8 \
--gpus-per-node=8 --partition=hpc \
train_llm.sh
CycleCloud vs Other Options
| Feature | CycleCloud | Azure Batch | AKS |
|---|---|---|---|
| Scheduler | Slurm, PBS, custom | Built-in | K8s native |
| Best for | HPC teams, research | Batch processing | Cloud-native teams |
| InfiniBand | Full support | Full support | Limited |
| Learning curve | Familiar to HPC users | Azure-native | K8s knowledge |
| Multi-node MPI | Native | Supported | Via operators |
Key Configuration Tips
- Partition design: Create separate Slurm partitions for training (ND-series) and inference (NC T4) workloads
- Auto-scale timers: Set idle timeout to 5-10 minutes to balance responsiveness and cost
- Shared storage: Mount Azure NetApp Files or BeeGFS for high-performance shared training data
- Spot fallback: Configure Spot VMs as primary with On-Demand fallback for critical training runs
Lilly Tech Systems