nvidia-smi Deep Dive Intermediate

nvidia-smi (NVIDIA System Management Interface) is the essential command-line tool for querying and managing NVIDIA GPUs. While most engineers only use the basic nvidia-smi command, it offers extensive querying, monitoring, and configuration capabilities. This lesson covers every important feature.

Essential nvidia-smi Commands

Bash
# Basic GPU status (the most common command)
nvidia-smi

# Continuous monitoring (refresh every 1 second)
nvidia-smi -l 1

# Query specific metrics in CSV format
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total --format=csv

# Monitor specific GPU processes
nvidia-smi pmon -i 0 -s um -d 1

# Check ECC error counts
nvidia-smi --query-gpu=ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.aggregate.total --format=csv

# View NVLink status and throughput
nvidia-smi nvlink -s -i 0

Understanding nvidia-smi Output

Field Description What to Watch For
GPU-Util Percentage of time GPU kernels are executing Low utilization during training = data loading bottleneck
Memory-Usage GPU memory allocated / total available Near 100% = risk of OOM; much lower = can increase batch size
Temp GPU die temperature in Celsius Above 83C triggers thermal throttling on most GPUs
Pwr:Usage/Cap Current power draw vs power limit Sustained power near cap indicates maximum compute load
Perf Performance state P0 (max) to P12 (min) P0 during training is expected; P8+ means GPU is idle

Automation with nvidia-smi Queries

Use nvidia-smi's query mode for scripting and integration with monitoring systems:

Bash
# Export GPU metrics to a log file every 10 seconds
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total,power.draw \
  --format=csv -l 10 >> /var/log/gpu_metrics.csv

# Alert script: check for idle GPUs
IDLE_GPUS=$(nvidia-smi --query-gpu=index,utilization.gpu --format=csv,noheader | awk -F',' '$2 < 5 {print $1}')
if [ -n "$IDLE_GPUS" ]; then
  echo "Warning: GPUs idle: $IDLE_GPUS"
fi

MIG (Multi-Instance GPU) Management

On supported GPUs (A100, H100), nvidia-smi can partition a single GPU into multiple isolated instances:

  • Enable MIG modenvidia-smi -i 0 -mig 1
  • Create GPU instancesnvidia-smi mig -i 0 -cgi 9,9,9 (three equal partitions)
  • List instancesnvidia-smi mig -i 0 -lgi
  • Destroy instancesnvidia-smi mig -i 0 -dgi
Pro Tip: nvidia-smi queries are lightweight but add latency if called too frequently. For production monitoring, use DCGM instead of polling nvidia-smi, as DCGM maintains a persistent connection to the GPU driver and provides more efficient metric collection.

Ready for Enterprise GPU Monitoring?

The next lesson covers NVIDIA DCGM for fleet-wide GPU monitoring with enterprise features.

Next: DCGM →