Prometheus for AI Infrastructure Intermediate

Prometheus is the de facto standard for metrics collection in Kubernetes environments. For AI infrastructure, Prometheus collects GPU metrics via DCGM Exporter, system metrics via node-exporter, and custom ML metrics via application-level instrumentation. This lesson covers deploying and configuring Prometheus specifically for ML workloads.

Deploying Prometheus with kube-prometheus-stack

Bash

# Install kube-prometheus-stack via Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi

GPU Metrics with DCGM Exporter

The NVIDIA DCGM Exporter exposes GPU metrics in Prometheus format. Key metrics include:

Metric	Description	Alert Threshold
`DCGM_FI_DEV_GPU_UTIL`	GPU compute utilization percentage	<10% for >30min (idle GPU)
`DCGM_FI_DEV_FB_USED`	GPU framebuffer memory used (MB)	>95% (OOM risk)
`DCGM_FI_DEV_GPU_TEMP`	GPU temperature (Celsius)	>85°C (thermal throttling)
`DCGM_FI_DEV_POWER_USAGE`	GPU power consumption (Watts)	Near TDP (cooling concern)

Recording Rules for ML Metrics

Recording rules pre-compute frequently queried expressions, reducing dashboard load time:

YAML

groups:
  - name: ml-infrastructure
    rules:
      - record: ml:gpu_utilization:avg_by_node
        expr: avg by (node) (DCGM_FI_DEV_GPU_UTIL)
      - record: ml:gpu_memory_used_ratio
        expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE
      - record: ml:inference_latency:p99
        expr: histogram_quantile(0.99, rate(inference_request_duration_seconds_bucket[5m]))

Custom ML Application Metrics

Instrument your ML applications to expose custom Prometheus metrics:

Training metrics — Expose loss, accuracy, learning rate, and epoch progress as Prometheus gauges
Serving metrics — Use histograms for inference latency and counters for request counts
Data pipeline metrics — Track records processed, pipeline duration, and data quality scores

Performance Tip: Set the scrape interval to 15-30 seconds for GPU metrics. Faster intervals create excessive time-series data without providing actionable insights. For training loss metrics, scrape every 60 seconds since loss changes slowly.

Ready to Build Grafana Dashboards?

The next lesson covers creating Grafana dashboards to visualize the Prometheus metrics you are now collecting.

Next: Grafana →

← Introduction Grafana →