nvidia-smi Deep Dive Intermediate
nvidia-smi (NVIDIA System Management Interface) is the essential command-line tool for querying and managing NVIDIA GPUs. While most engineers only use the basic nvidia-smi command, it offers extensive querying, monitoring, and configuration capabilities. This lesson covers every important feature.
Essential nvidia-smi Commands
Bash
# Basic GPU status (the most common command) nvidia-smi # Continuous monitoring (refresh every 1 second) nvidia-smi -l 1 # Query specific metrics in CSV format nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,memory.used,memory.total --format=csv # Monitor specific GPU processes nvidia-smi pmon -i 0 -s um -d 1 # Check ECC error counts nvidia-smi --query-gpu=ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.aggregate.total --format=csv # View NVLink status and throughput nvidia-smi nvlink -s -i 0
Understanding nvidia-smi Output
| Field | Description | What to Watch For |
|---|---|---|
| GPU-Util | Percentage of time GPU kernels are executing | Low utilization during training = data loading bottleneck |
| Memory-Usage | GPU memory allocated / total available | Near 100% = risk of OOM; much lower = can increase batch size |
| Temp | GPU die temperature in Celsius | Above 83C triggers thermal throttling on most GPUs |
| Pwr:Usage/Cap | Current power draw vs power limit | Sustained power near cap indicates maximum compute load |
| Perf | Performance state P0 (max) to P12 (min) | P0 during training is expected; P8+ means GPU is idle |
Automation with nvidia-smi Queries
Use nvidia-smi's query mode for scripting and integration with monitoring systems:
Bash
# Export GPU metrics to a log file every 10 seconds nvidia-smi --query-gpu=timestamp,name,pci.bus_id,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total,power.draw \ --format=csv -l 10 >> /var/log/gpu_metrics.csv # Alert script: check for idle GPUs IDLE_GPUS=$(nvidia-smi --query-gpu=index,utilization.gpu --format=csv,noheader | awk -F',' '$2 < 5 {print $1}') if [ -n "$IDLE_GPUS" ]; then echo "Warning: GPUs idle: $IDLE_GPUS" fi
MIG (Multi-Instance GPU) Management
On supported GPUs (A100, H100), nvidia-smi can partition a single GPU into multiple isolated instances:
- Enable MIG mode —
nvidia-smi -i 0 -mig 1 - Create GPU instances —
nvidia-smi mig -i 0 -cgi 9,9,9(three equal partitions) - List instances —
nvidia-smi mig -i 0 -lgi - Destroy instances —
nvidia-smi mig -i 0 -dgi
Pro Tip: nvidia-smi queries are lightweight but add latency if called too frequently. For production monitoring, use DCGM instead of polling nvidia-smi, as DCGM maintains a persistent connection to the GPU driver and provides more efficient metric collection.
Ready for Enterprise GPU Monitoring?
The next lesson covers NVIDIA DCGM for fleet-wide GPU monitoring with enterprise features.
Next: DCGM →
Lilly Tech Systems