NVIDIA DCGM Intermediate
NVIDIA Data Center GPU Manager (DCGM) is the enterprise-grade tool for managing and monitoring NVIDIA GPUs at scale. Unlike nvidia-smi, DCGM runs as a persistent daemon, supports remote monitoring, provides GPU health diagnostics, and integrates natively with Prometheus and Kubernetes via the DCGM Exporter.
DCGM Architecture
- nv-hostengine — The DCGM daemon that runs on each GPU node, collecting metrics and managing GPU state
- dcgmi — The CLI tool for interacting with the DCGM daemon
- DCGM Exporter — A container that exposes DCGM metrics in Prometheus format on port 9400
- DCGM APIs — C, Python, and Go bindings for programmatic GPU management
Deploying DCGM Exporter on Kubernetes
Bash
# Deploy DCGM Exporter as a DaemonSet
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
--namespace monitoring \
--set serviceMonitor.enabled=true \
--set serviceMonitor.interval=15s
DCGM Health Checks
DCGM provides comprehensive GPU health diagnostics that go far beyond what nvidia-smi offers:
Bash
# Run a quick health check (Level 1 - seconds) dcgmi health -c -g 0 # Run a medium diagnostic (Level 2 - minutes) dcgmi diag -r 2 -g 0 # Run a full diagnostic (Level 3 - 10+ minutes) dcgmi diag -r 3 -g 0 # Watch health status continuously dcgmi health -w -g 0
Key DCGM Metrics for Prometheus
| DCGM Metric | Prometheus Name | Use Case |
|---|---|---|
| GPU Utilization | DCGM_FI_DEV_GPU_UTIL |
Overall GPU compute activity |
| Memory Copy Utilization | DCGM_FI_DEV_MEM_COPY_UTIL |
Memory bandwidth saturation |
| Tensor Core Activity | DCGM_FI_PROF_PIPE_TENSOR_ACTIVE |
ML workload efficiency |
| NVLink Bandwidth | DCGM_FI_PROF_NVLINK_TX/RX_BYTES |
Multi-GPU communication health |
| XID Errors | DCGM_FI_DEV_XID_ERRORS |
GPU hardware fault detection |
GPU Groups and Policies
DCGM allows you to create GPU groups and apply monitoring policies:
- GPU groups — Organize GPUs by function (training pool, inference pool) for targeted monitoring
- Field groups — Define custom sets of metrics to collect for different monitoring needs
- Policies — Set thresholds that trigger actions (e.g., drain a node when ECC errors exceed a threshold)
Production Tip: Run DCGM Level 2 diagnostics on every new GPU node before admitting it to the cluster, and again after driver updates. This catches hardware issues before they corrupt training runs.
Ready to Build GPU Dashboards?
The next lesson covers creating Grafana dashboards specifically for GPU fleet monitoring using DCGM metrics.
Next: Grafana GPU →
Lilly Tech Systems