Learn Azure GPU VMs & HPC
Master Azure's GPU virtual machines and high-performance computing infrastructure for AI. From N-Series VMs and InfiniBand networking to Azure Batch and CycleCloud for large-scale distributed training.
Your Learning Path
Follow these lessons in order, or jump to any topic that interests you.
1. Introduction
Overview of Azure GPU and HPC infrastructure, use cases, and choosing the right compute tier for AI.
2. N-Series VMs
Deep dive into NC, ND, NV, and NG VM families with GPU specifications, pricing, and workload matching.
3. InfiniBand
RDMA networking for distributed training with HDR and NDR InfiniBand on ND-series VMs.
4. Batch AI
Azure Batch for parallel GPU workloads, job scheduling, auto-scaling pools, and container support.
5. CycleCloud
Azure CycleCloud for HPC cluster orchestration with Slurm, PBS, and custom schedulers for AI training.
6. Best Practices
Performance tuning, cost optimization, security, and operational guidelines for GPU HPC on Azure.
What You'll Learn
By the end of this course, you'll be able to:
Choose GPU VMs
Select the right N-Series VM family and size for your specific AI training and inference workloads.
Build HPC Clusters
Set up distributed training clusters with InfiniBand RDMA for near-linear multi-node scaling.
Orchestrate Jobs
Use Azure Batch and CycleCloud to manage large-scale GPU compute jobs with auto-scaling.
Optimize Performance
Tune NCCL, GPU drivers, and network settings for maximum distributed training throughput.
Lilly Tech Systems