NVIDIA DCGM Intermediate

NVIDIA Data Center GPU Manager (DCGM) is the enterprise-grade tool for managing and monitoring NVIDIA GPUs at scale. Unlike nvidia-smi, DCGM runs as a persistent daemon, supports remote monitoring, provides GPU health diagnostics, and integrates natively with Prometheus and Kubernetes via the DCGM Exporter.

DCGM Architecture

  • nv-hostengine — The DCGM daemon that runs on each GPU node, collecting metrics and managing GPU state
  • dcgmi — The CLI tool for interacting with the DCGM daemon
  • DCGM Exporter — A container that exposes DCGM metrics in Prometheus format on port 9400
  • DCGM APIs — C, Python, and Go bindings for programmatic GPU management

Deploying DCGM Exporter on Kubernetes

Bash
# Deploy DCGM Exporter as a DaemonSet
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring \
  --set serviceMonitor.enabled=true \
  --set serviceMonitor.interval=15s

DCGM Health Checks

DCGM provides comprehensive GPU health diagnostics that go far beyond what nvidia-smi offers:

Bash
# Run a quick health check (Level 1 - seconds)
dcgmi health -c -g 0

# Run a medium diagnostic (Level 2 - minutes)
dcgmi diag -r 2 -g 0

# Run a full diagnostic (Level 3 - 10+ minutes)
dcgmi diag -r 3 -g 0

# Watch health status continuously
dcgmi health -w -g 0

Key DCGM Metrics for Prometheus

DCGM Metric Prometheus Name Use Case
GPU Utilization DCGM_FI_DEV_GPU_UTIL Overall GPU compute activity
Memory Copy Utilization DCGM_FI_DEV_MEM_COPY_UTIL Memory bandwidth saturation
Tensor Core Activity DCGM_FI_PROF_PIPE_TENSOR_ACTIVE ML workload efficiency
NVLink Bandwidth DCGM_FI_PROF_NVLINK_TX/RX_BYTES Multi-GPU communication health
XID Errors DCGM_FI_DEV_XID_ERRORS GPU hardware fault detection

GPU Groups and Policies

DCGM allows you to create GPU groups and apply monitoring policies:

  • GPU groups — Organize GPUs by function (training pool, inference pool) for targeted monitoring
  • Field groups — Define custom sets of metrics to collect for different monitoring needs
  • Policies — Set thresholds that trigger actions (e.g., drain a node when ECC errors exceed a threshold)
Production Tip: Run DCGM Level 2 diagnostics on every new GPU node before admitting it to the cluster, and again after driver updates. This catches hardware issues before they corrupt training runs.

Ready to Build GPU Dashboards?

The next lesson covers creating Grafana dashboards specifically for GPU fleet monitoring using DCGM metrics.

Next: Grafana GPU →