Introduction to GPU Monitoring Beginner

GPUs are the most expensive and critical resource in any AI infrastructure. A single NVIDIA H100 GPU costs $25,000-$40,000, and a typical AI training cluster has hundreds or thousands of them. Effective GPU monitoring ensures maximum utilization, early fault detection, and optimal resource allocation. This lesson introduces the key concepts and metrics for GPU monitoring.

GPU Architecture Basics for Monitoring

Understanding GPU architecture helps interpret monitoring metrics correctly:

  • Streaming Multiprocessors (SMs) — The compute units; GPU utilization measures the percentage of time SMs are active
  • HBM/GDDR Memory — High-bandwidth memory for storing model weights and activations; OOM errors occur when this is exhausted
  • NVLink/NVSwitch — High-speed GPU-to-GPU interconnect; critical for multi-GPU training performance
  • PCIe Bus — Connection to the host CPU; bottleneck for data loading and CPU-GPU communication
  • Tensor Cores — Specialized matrix multiplication units; utilization indicates how efficiently the GPU runs ML workloads

Key GPU Metrics

Metric Category Metrics Why It Matters
Compute SM utilization, tensor core utilization Indicates whether workloads effectively use the GPU
Memory Used/free/total memory, memory bandwidth utilization OOM prevention and right-sizing workloads
Thermal GPU temperature, memory temperature Thermal throttling detection, cooling adequacy
Power Power draw, power limit, energy consumption Power budget management, efficiency tracking
Reliability ECC errors (correctable/uncorrectable), XID errors Hardware health, predictive maintenance

GPU vs CPU Monitoring Differences

  • Utilization interpretation — 100% CPU utilization is often bad (overloaded); 100% GPU utilization is usually good (fully utilized expensive hardware)
  • Memory model — GPU memory is fixed and not swappable; running out means immediate failure, not slowdown
  • Error handling — GPU ECC errors can silently corrupt training results; they need active monitoring
  • Interconnect — GPU-to-GPU communication bandwidth (NVLink) is as important as compute metrics for distributed training
Key Insight: The most common GPU monitoring mistake is looking only at GPU utilization. A GPU can show 100% utilization while only using 10% of its tensor core capability because the workload is memory-bandwidth limited. Always monitor multiple dimensions simultaneously.

Ready to Master nvidia-smi?

The next lesson provides a comprehensive guide to nvidia-smi, the fundamental GPU monitoring command-line tool.

Next: nvidia-smi →