Introduction to GPU Monitoring Beginner
GPUs are the most expensive and critical resource in any AI infrastructure. A single NVIDIA H100 GPU costs $25,000-$40,000, and a typical AI training cluster has hundreds or thousands of them. Effective GPU monitoring ensures maximum utilization, early fault detection, and optimal resource allocation. This lesson introduces the key concepts and metrics for GPU monitoring.
GPU Architecture Basics for Monitoring
Understanding GPU architecture helps interpret monitoring metrics correctly:
- Streaming Multiprocessors (SMs) — The compute units; GPU utilization measures the percentage of time SMs are active
- HBM/GDDR Memory — High-bandwidth memory for storing model weights and activations; OOM errors occur when this is exhausted
- NVLink/NVSwitch — High-speed GPU-to-GPU interconnect; critical for multi-GPU training performance
- PCIe Bus — Connection to the host CPU; bottleneck for data loading and CPU-GPU communication
- Tensor Cores — Specialized matrix multiplication units; utilization indicates how efficiently the GPU runs ML workloads
Key GPU Metrics
| Metric Category | Metrics | Why It Matters |
|---|---|---|
| Compute | SM utilization, tensor core utilization | Indicates whether workloads effectively use the GPU |
| Memory | Used/free/total memory, memory bandwidth utilization | OOM prevention and right-sizing workloads |
| Thermal | GPU temperature, memory temperature | Thermal throttling detection, cooling adequacy |
| Power | Power draw, power limit, energy consumption | Power budget management, efficiency tracking |
| Reliability | ECC errors (correctable/uncorrectable), XID errors | Hardware health, predictive maintenance |
GPU vs CPU Monitoring Differences
- Utilization interpretation — 100% CPU utilization is often bad (overloaded); 100% GPU utilization is usually good (fully utilized expensive hardware)
- Memory model — GPU memory is fixed and not swappable; running out means immediate failure, not slowdown
- Error handling — GPU ECC errors can silently corrupt training results; they need active monitoring
- Interconnect — GPU-to-GPU communication bandwidth (NVLink) is as important as compute metrics for distributed training
Key Insight: The most common GPU monitoring mistake is looking only at GPU utilization. A GPU can show 100% utilization while only using 10% of its tensor core capability because the workload is memory-bandwidth limited. Always monitor multiple dimensions simultaneously.
Ready to Master nvidia-smi?
The next lesson provides a comprehensive guide to nvidia-smi, the fundamental GPU monitoring command-line tool.
Next: nvidia-smi →
Lilly Tech Systems