Beginner
Introduction to Azure GPU VMs & HPC
Understand Azure's GPU virtual machine offerings and high-performance computing capabilities for AI training, inference, and scientific computing.
Azure GPU Infrastructure
Azure offers one of the largest GPU fleets in the cloud with NVIDIA A100, H100, and H200 GPUs connected via InfiniBand for distributed AI workloads. The N-Series VM families provide GPU compute at various price-performance points.
NC-Series
General-purpose GPU VMs for training and inference. A100 GPUs with up to 4 GPUs per VM.
ND-Series
High-performance training VMs. 8x H100 or A100 GPUs with InfiniBand RDMA for distributed training.
NV-Series
Visualization and inference VMs. A10 GPUs for cost-effective model serving and rendering.
HPC Infrastructure
Azure Batch, CycleCloud, and managed HPC schedulers (Slurm, PBS) for large-scale job orchestration.
When to Use GPU VMs vs Managed Services
| Scenario | Recommendation | Why |
|---|---|---|
| Custom training frameworks | GPU VMs / CycleCloud | Full control over software stack and scheduling |
| Managed ML pipelines | Azure ML Compute | Integrated with Azure ML for experiment tracking |
| Multi-node distributed | ND-series + InfiniBand | RDMA networking for linear scaling |
| Batch processing | Azure Batch | Job scheduling with auto-scaling pools |
| Cost-effective inference | NC T4 or NV A10 | Best price-performance for serving |
Good to know: Azure has the largest publicly available H100 GPU fleet with ND H100 v5 VMs featuring 8x H100 GPUs and 3200 Gbps NDR InfiniBand. These VMs can scale to thousands of GPUs for training large language models, competitive with the infrastructure used by major AI labs.
Key takeaway: Azure's GPU and HPC infrastructure gives you the flexibility to run any AI workload at any scale. The key decisions are: which VM family matches your workload (NC for general, ND for training), how to orchestrate jobs (Batch, CycleCloud, or AKS), and how to optimize costs (Spot, Reserved Instances, right-sizing).
Lilly Tech Systems