Introduction to GPU Monitoring Beginner

GPUs are the most expensive and critical resource in any AI infrastructure. A single NVIDIA H100 GPU costs $25,000-$40,000, and a typical AI training cluster has hundreds or thousands of them. Effective GPU monitoring ensures maximum utilization, early fault detection, and optimal resource allocation. This lesson introduces the key concepts and metrics for GPU monitoring.

GPU Architecture Basics for Monitoring

Understanding GPU architecture helps interpret monitoring metrics correctly:

Streaming Multiprocessors (SMs) — The compute units; GPU utilization measures the percentage of time SMs are active
HBM/GDDR Memory — High-bandwidth memory for storing model weights and activations; OOM errors occur when this is exhausted
NVLink/NVSwitch — High-speed GPU-to-GPU interconnect; critical for multi-GPU training performance
PCIe Bus — Connection to the host CPU; bottleneck for data loading and CPU-GPU communication
Tensor Cores — Specialized matrix multiplication units; utilization indicates how efficiently the GPU runs ML workloads

Key GPU Metrics

Metric Category	Metrics	Why It Matters
Compute	SM utilization, tensor core utilization	Indicates whether workloads effectively use the GPU
Memory	Used/free/total memory, memory bandwidth utilization	OOM prevention and right-sizing workloads
Thermal	GPU temperature, memory temperature	Thermal throttling detection, cooling adequacy
Power	Power draw, power limit, energy consumption	Power budget management, efficiency tracking
Reliability	ECC errors (correctable/uncorrectable), XID errors	Hardware health, predictive maintenance

GPU vs CPU Monitoring Differences

Utilization interpretation — 100% CPU utilization is often bad (overloaded); 100% GPU utilization is usually good (fully utilized expensive hardware)
Memory model — GPU memory is fixed and not swappable; running out means immediate failure, not slowdown
Error handling — GPU ECC errors can silently corrupt training results; they need active monitoring
Interconnect — GPU-to-GPU communication bandwidth (NVLink) is as important as compute metrics for distributed training

Key Insight: The most common GPU monitoring mistake is looking only at GPU utilization. A GPU can show 100% utilization while only using 10% of its tensor core capability because the workload is memory-bandwidth limited. Always monitor multiple dimensions simultaneously.

Ready to Master nvidia-smi?

The next lesson provides a comprehensive guide to nvidia-smi, the fundamental GPU monitoring command-line tool.

Next: nvidia-smi →

← Course Overview nvidia-smi →