Grafana GPU Dashboards Intermediate

This lesson covers building production-grade Grafana dashboards for GPU fleet monitoring. You will learn to create utilization heatmaps, per-GPU memory tracking panels, thermal monitoring views, error rate displays, and power consumption charts using DCGM Exporter metrics from Prometheus.

GPU Fleet Overview Dashboard

The fleet overview provides a bird's-eye view of all GPUs across your cluster:

PromQL
# Total GPU count
count(DCGM_FI_DEV_GPU_UTIL)

# Average GPU utilization across the fleet
avg(DCGM_FI_DEV_GPU_UTIL)

# GPUs with utilization below 10% (idle)
count(DCGM_FI_DEV_GPU_UTIL < 10)

# GPU utilization heatmap (use in heatmap panel)
DCGM_FI_DEV_GPU_UTIL

Per-GPU Detail Panels

  • Utilization time series — Stacked area chart showing compute and memory utilization over time for each GPU
  • Memory waterfall — Bar gauge showing memory used vs total for each GPU
  • Temperature heatmap — Color-coded GPU temperature across all devices, highlighting hotspots
  • Power draw — Time series with power limit line overlay to show headroom
  • ECC error counter — Table showing correctable and uncorrectable error counts per GPU

NVLink and Interconnect Dashboard

For multi-GPU training, NVLink bandwidth is critical. Create panels that show:

PromQL
# NVLink transmit bandwidth per GPU
rate(DCGM_FI_PROF_NVLINK_TX_BYTES[1m]) / 1e9

# NVLink receive bandwidth per GPU
rate(DCGM_FI_PROF_NVLINK_RX_BYTES[1m]) / 1e9

# PCIe throughput (host-GPU communication)
rate(DCGM_FI_PROF_PCIE_TX_BYTES[1m]) / 1e9

Dashboard Organization

Organize GPU dashboards in a hierarchy:

  1. Fleet Overview

    High-level stats for the entire GPU fleet: total GPUs, average utilization, alerts firing, cost.

  2. Node Detail

    Per-node view showing all GPUs on a specific server with their individual metrics.

  3. GPU Detail

    Deep dive into a single GPU: all DCGM metrics, process list, historical trends.

  4. Job View

    GPU metrics filtered by training job or namespace, correlating GPU usage with ML workload progress.

Dashboard Tip: Import the official NVIDIA DCGM Exporter Grafana dashboard (ID: 12239) as a starting point, then customize it for your specific needs. This saves hours of initial setup time.

Ready to Learn GPU Scheduling?

The next lesson covers GPU scheduling strategies including MIG partitioning, time-slicing, and topology-aware placement.

Next: Scheduling →