Advanced

Azure CycleCloud for AI

Orchestrate HPC clusters on Azure with CycleCloud using Slurm, PBS, or custom schedulers for large-scale distributed AI training.

What is CycleCloud?

Azure CycleCloud is a tool for creating, managing, and optimizing HPC clusters on Azure. It supports popular job schedulers (Slurm, PBS Pro, Grid Engine) and provides auto-scaling, cost management, and monitoring for GPU clusters used in AI training.

🛠

Scheduler Integration

Native support for Slurm, PBS Pro, Grid Engine, and custom schedulers familiar to HPC and research teams.

🔄

Auto-Scaling

Automatically provisions and deprovisions GPU nodes based on job queue depth, scaling to zero when idle.

📈

Cost Controls

Built-in cost tracking, budget limits, and Spot VM support with automatic preemption handling.

🔒

Enterprise Ready

VNet integration, managed identity, and RBAC for secure, compliant HPC cluster deployments.

CycleCloud Slurm Cluster for AI

# CycleCloud CLI - Create a GPU Slurm cluster
cyclecloud create_cluster slurm-gpu \
  --parameter "Region=eastus" \
  --parameter "SubnetId=/subscriptions/.../subnets/hpc" \
  --parameter "SchedulerMachineType=Standard_D4s_v5" \
  --parameter "HPCMachineType=Standard_ND96asr_v4" \
  --parameter "MaxHPCExecuteCoreCount=768" \
  --parameter "UseSpot=true"

# Submit a training job via Slurm
sbatch --nodes=4 --ntasks-per-node=8 \
  --gpus-per-node=8 --partition=hpc \
  train_llm.sh

CycleCloud vs Other Options

FeatureCycleCloudAzure BatchAKS
SchedulerSlurm, PBS, customBuilt-inK8s native
Best forHPC teams, researchBatch processingCloud-native teams
InfiniBandFull supportFull supportLimited
Learning curveFamiliar to HPC usersAzure-nativeK8s knowledge
Multi-node MPINativeSupportedVia operators

Key Configuration Tips

  • Partition design: Create separate Slurm partitions for training (ND-series) and inference (NC T4) workloads
  • Auto-scale timers: Set idle timeout to 5-10 minutes to balance responsiveness and cost
  • Shared storage: Mount Azure NetApp Files or BeeGFS for high-performance shared training data
  • Spot fallback: Configure Spot VMs as primary with On-Demand fallback for critical training runs
Pro tip: CycleCloud is the best choice for teams transitioning from on-premise HPC clusters to Azure. Data scientists can continue using familiar Slurm commands while the infrastructure team benefits from Azure's auto-scaling and cost management. The transition is nearly transparent to end users.