Beginner

Introduction to Azure GPU VMs & HPC

Understand Azure's GPU virtual machine offerings and high-performance computing capabilities for AI training, inference, and scientific computing.

Azure GPU Infrastructure

Azure offers one of the largest GPU fleets in the cloud with NVIDIA A100, H100, and H200 GPUs connected via InfiniBand for distributed AI workloads. The N-Series VM families provide GPU compute at various price-performance points.

💻

NC-Series

General-purpose GPU VMs for training and inference. A100 GPUs with up to 4 GPUs per VM.

ND-Series

High-performance training VMs. 8x H100 or A100 GPUs with InfiniBand RDMA for distributed training.

📈

NV-Series

Visualization and inference VMs. A10 GPUs for cost-effective model serving and rendering.

🛠

HPC Infrastructure

Azure Batch, CycleCloud, and managed HPC schedulers (Slurm, PBS) for large-scale job orchestration.

When to Use GPU VMs vs Managed Services

ScenarioRecommendationWhy
Custom training frameworksGPU VMs / CycleCloudFull control over software stack and scheduling
Managed ML pipelinesAzure ML ComputeIntegrated with Azure ML for experiment tracking
Multi-node distributedND-series + InfiniBandRDMA networking for linear scaling
Batch processingAzure BatchJob scheduling with auto-scaling pools
Cost-effective inferenceNC T4 or NV A10Best price-performance for serving
💡
Good to know: Azure has the largest publicly available H100 GPU fleet with ND H100 v5 VMs featuring 8x H100 GPUs and 3200 Gbps NDR InfiniBand. These VMs can scale to thousands of GPUs for training large language models, competitive with the infrastructure used by major AI labs.
Key takeaway: Azure's GPU and HPC infrastructure gives you the flexibility to run any AI workload at any scale. The key decisions are: which VM family matches your workload (NC for general, ND for training), how to orchestrate jobs (Batch, CycleCloud, or AKS), and how to optimize costs (Spot, Reserved Instances, right-sizing).