Grafana GPU Dashboards Intermediate
This lesson covers building production-grade Grafana dashboards for GPU fleet monitoring. You will learn to create utilization heatmaps, per-GPU memory tracking panels, thermal monitoring views, error rate displays, and power consumption charts using DCGM Exporter metrics from Prometheus.
GPU Fleet Overview Dashboard
The fleet overview provides a bird's-eye view of all GPUs across your cluster:
# Total GPU count count(DCGM_FI_DEV_GPU_UTIL) # Average GPU utilization across the fleet avg(DCGM_FI_DEV_GPU_UTIL) # GPUs with utilization below 10% (idle) count(DCGM_FI_DEV_GPU_UTIL < 10) # GPU utilization heatmap (use in heatmap panel) DCGM_FI_DEV_GPU_UTIL
Per-GPU Detail Panels
- Utilization time series — Stacked area chart showing compute and memory utilization over time for each GPU
- Memory waterfall — Bar gauge showing memory used vs total for each GPU
- Temperature heatmap — Color-coded GPU temperature across all devices, highlighting hotspots
- Power draw — Time series with power limit line overlay to show headroom
- ECC error counter — Table showing correctable and uncorrectable error counts per GPU
NVLink and Interconnect Dashboard
For multi-GPU training, NVLink bandwidth is critical. Create panels that show:
# NVLink transmit bandwidth per GPU rate(DCGM_FI_PROF_NVLINK_TX_BYTES[1m]) / 1e9 # NVLink receive bandwidth per GPU rate(DCGM_FI_PROF_NVLINK_RX_BYTES[1m]) / 1e9 # PCIe throughput (host-GPU communication) rate(DCGM_FI_PROF_PCIE_TX_BYTES[1m]) / 1e9
Dashboard Organization
Organize GPU dashboards in a hierarchy:
- Fleet Overview
High-level stats for the entire GPU fleet: total GPUs, average utilization, alerts firing, cost.
- Node Detail
Per-node view showing all GPUs on a specific server with their individual metrics.
- GPU Detail
Deep dive into a single GPU: all DCGM metrics, process list, historical trends.
- Job View
GPU metrics filtered by training job or namespace, correlating GPU usage with ML workload progress.
Ready to Learn GPU Scheduling?
The next lesson covers GPU scheduling strategies including MIG partitioning, time-slicing, and topology-aware placement.
Next: Scheduling →
Lilly Tech Systems