Prometheus for AI Infrastructure Intermediate
Prometheus is the de facto standard for metrics collection in Kubernetes environments. For AI infrastructure, Prometheus collects GPU metrics via DCGM Exporter, system metrics via node-exporter, and custom ML metrics via application-level instrumentation. This lesson covers deploying and configuring Prometheus specifically for ML workloads.
Deploying Prometheus with kube-prometheus-stack
Bash
# Install kube-prometheus-stack via Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi
GPU Metrics with DCGM Exporter
The NVIDIA DCGM Exporter exposes GPU metrics in Prometheus format. Key metrics include:
| Metric | Description | Alert Threshold |
|---|---|---|
DCGM_FI_DEV_GPU_UTIL |
GPU compute utilization percentage | <10% for >30min (idle GPU) |
DCGM_FI_DEV_FB_USED |
GPU framebuffer memory used (MB) | >95% (OOM risk) |
DCGM_FI_DEV_GPU_TEMP |
GPU temperature (Celsius) | >85°C (thermal throttling) |
DCGM_FI_DEV_POWER_USAGE |
GPU power consumption (Watts) | Near TDP (cooling concern) |
Recording Rules for ML Metrics
Recording rules pre-compute frequently queried expressions, reducing dashboard load time:
YAML
groups: - name: ml-infrastructure rules: - record: ml:gpu_utilization:avg_by_node expr: avg by (node) (DCGM_FI_DEV_GPU_UTIL) - record: ml:gpu_memory_used_ratio expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE - record: ml:inference_latency:p99 expr: histogram_quantile(0.99, rate(inference_request_duration_seconds_bucket[5m]))
Custom ML Application Metrics
Instrument your ML applications to expose custom Prometheus metrics:
- Training metrics — Expose loss, accuracy, learning rate, and epoch progress as Prometheus gauges
- Serving metrics — Use histograms for inference latency and counters for request counts
- Data pipeline metrics — Track records processed, pipeline duration, and data quality scores
Performance Tip: Set the scrape interval to 15-30 seconds for GPU metrics. Faster intervals create excessive time-series data without providing actionable insights. For training loss metrics, scrape every 60 seconds since loss changes slowly.
Ready to Build Grafana Dashboards?
The next lesson covers creating Grafana dashboards to visualize the Prometheus metrics you are now collecting.
Next: Grafana →