NVIDIA DCGM Intermediate

NVIDIA Data Center GPU Manager (DCGM) is the enterprise-grade tool for managing and monitoring NVIDIA GPUs at scale. Unlike nvidia-smi, DCGM runs as a persistent daemon, supports remote monitoring, provides GPU health diagnostics, and integrates natively with Prometheus and Kubernetes via the DCGM Exporter.

DCGM Architecture

nv-hostengine — The DCGM daemon that runs on each GPU node, collecting metrics and managing GPU state
dcgmi — The CLI tool for interacting with the DCGM daemon
DCGM Exporter — A container that exposes DCGM metrics in Prometheus format on port 9400
DCGM APIs — C, Python, and Go bindings for programmatic GPU management

Deploying DCGM Exporter on Kubernetes

Bash

# Deploy DCGM Exporter as a DaemonSet
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace monitoring \
  --set serviceMonitor.enabled=true \
  --set serviceMonitor.interval=15s

DCGM Health Checks

DCGM provides comprehensive GPU health diagnostics that go far beyond what nvidia-smi offers:

Bash

# Run a quick health check (Level 1 - seconds)
dcgmi health -c -g 0

# Run a medium diagnostic (Level 2 - minutes)
dcgmi diag -r 2 -g 0

# Run a full diagnostic (Level 3 - 10+ minutes)
dcgmi diag -r 3 -g 0

# Watch health status continuously
dcgmi health -w -g 0

Key DCGM Metrics for Prometheus

DCGM Metric	Prometheus Name	Use Case
GPU Utilization	`DCGM_FI_DEV_GPU_UTIL`	Overall GPU compute activity
Memory Copy Utilization	`DCGM_FI_DEV_MEM_COPY_UTIL`	Memory bandwidth saturation
Tensor Core Activity	`DCGM_FI_PROF_PIPE_TENSOR_ACTIVE`	ML workload efficiency
NVLink Bandwidth	`DCGM_FI_PROF_NVLINK_TX/RX_BYTES`	Multi-GPU communication health
XID Errors	`DCGM_FI_DEV_XID_ERRORS`	GPU hardware fault detection

GPU Groups and Policies

DCGM allows you to create GPU groups and apply monitoring policies:

GPU groups — Organize GPUs by function (training pool, inference pool) for targeted monitoring
Field groups — Define custom sets of metrics to collect for different monitoring needs
Policies — Set thresholds that trigger actions (e.g., drain a node when ECC errors exceed a threshold)

Production Tip: Run DCGM Level 2 diagnostics on every new GPU node before admitting it to the cluster, and again after driver updates. This catches hardware issues before they corrupt training runs.

Ready to Build GPU Dashboards?

The next lesson covers creating Grafana dashboards specifically for GPU fleet monitoring using DCGM metrics.

Next: Grafana GPU →

← nvidia-smi Grafana GPU →