Learn Azure GPU VMs & HPC

Master Azure's GPU virtual machines and high-performance computing infrastructure for AI. From N-Series VMs and InfiniBand networking to Azure Batch and CycleCloud for large-scale distributed training.

Start Course →View All Lessons

Lessons

✍

Hands-On Labs

🕑

Self-Paced

100%

Free

Your Learning Path

Follow these lessons in order, or jump to any topic that interests you.

Beginner

◈

1. Introduction

Overview of Azure GPU and HPC infrastructure, use cases, and choosing the right compute tier for AI.

Start here →

Intermediate

⚡

2. N-Series VMs

Deep dive into NC, ND, NV, and NG VM families with GPU specifications, pricing, and workload matching.

10 min read →

Intermediate

🛠

3. InfiniBand

RDMA networking for distributed training with HDR and NDR InfiniBand on ND-series VMs.

12 min read →

Intermediate

⚙

4. Batch AI

Azure Batch for parallel GPU workloads, job scheduling, auto-scaling pools, and container support.

15 min read →

Advanced

🚀

5. CycleCloud

Azure CycleCloud for HPC cluster orchestration with Slurm, PBS, and custom schedulers for AI training.

12 min read →

Advanced

☆

6. Best Practices

Performance tuning, cost optimization, security, and operational guidelines for GPU HPC on Azure.

10 min read →

What You'll Learn

By the end of this course, you'll be able to:

💻

Choose GPU VMs

Select the right N-Series VM family and size for your specific AI training and inference workloads.

⚡

Build HPC Clusters

Set up distributed training clusters with InfiniBand RDMA for near-linear multi-node scaling.

🔄

Orchestrate Jobs

Use Azure Batch and CycleCloud to manage large-scale GPU compute jobs with auto-scaling.

📈

Optimize Performance

Tune NCCL, GPU drivers, and network settings for maximum distributed training throughput.