Prometheus for AI Infrastructure Intermediate

Prometheus is the de facto standard for metrics collection in Kubernetes environments. For AI infrastructure, Prometheus collects GPU metrics via DCGM Exporter, system metrics via node-exporter, and custom ML metrics via application-level instrumentation. This lesson covers deploying and configuring Prometheus specifically for ML workloads.

Deploying Prometheus with kube-prometheus-stack

Bash
# Install kube-prometheus-stack via Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi

GPU Metrics with DCGM Exporter

The NVIDIA DCGM Exporter exposes GPU metrics in Prometheus format. Key metrics include:

Metric Description Alert Threshold
DCGM_FI_DEV_GPU_UTIL GPU compute utilization percentage <10% for >30min (idle GPU)
DCGM_FI_DEV_FB_USED GPU framebuffer memory used (MB) >95% (OOM risk)
DCGM_FI_DEV_GPU_TEMP GPU temperature (Celsius) >85°C (thermal throttling)
DCGM_FI_DEV_POWER_USAGE GPU power consumption (Watts) Near TDP (cooling concern)

Recording Rules for ML Metrics

Recording rules pre-compute frequently queried expressions, reducing dashboard load time:

YAML
groups:
  - name: ml-infrastructure
    rules:
      - record: ml:gpu_utilization:avg_by_node
        expr: avg by (node) (DCGM_FI_DEV_GPU_UTIL)
      - record: ml:gpu_memory_used_ratio
        expr: DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE
      - record: ml:inference_latency:p99
        expr: histogram_quantile(0.99, rate(inference_request_duration_seconds_bucket[5m]))

Custom ML Application Metrics

Instrument your ML applications to expose custom Prometheus metrics:

  • Training metrics — Expose loss, accuracy, learning rate, and epoch progress as Prometheus gauges
  • Serving metrics — Use histograms for inference latency and counters for request counts
  • Data pipeline metrics — Track records processed, pipeline duration, and data quality scores
Performance Tip: Set the scrape interval to 15-30 seconds for GPU metrics. Faster intervals create excessive time-series data without providing actionable insights. For training loss metrics, scrape every 60 seconds since loss changes slowly.

Ready to Build Grafana Dashboards?

The next lesson covers creating Grafana dashboards to visualize the Prometheus metrics you are now collecting.

Next: Grafana →