Grafana for AI Infrastructure Intermediate

Grafana transforms raw Prometheus metrics into actionable dashboards. For AI infrastructure, you need dashboards that show GPU cluster health at a glance, drill down into individual training jobs, track model serving SLOs, and help with capacity planning. This lesson covers building these dashboards from scratch.

GPU Cluster Overview Dashboard

The first dashboard every ML platform needs shows the overall health of the GPU cluster:

Total GPUs vs Active GPUs — Stat panel showing cluster-wide GPU allocation
Average GPU Utilization — Gauge panel with color thresholds (red <20%, yellow 20-60%, green >60%)
GPU Memory Usage Heatmap — Heatmap showing memory usage distribution across all GPUs
GPU Temperature — Time series with alert threshold lines
Per-Node GPU Status — Table showing each node's GPU count, utilization, and health

Training Job Dashboard

Track the progress and resource usage of individual training jobs:

PromQL

# Training loss over time (filtered by job name)
training_loss{job_name=~"$job_name"}

# GPU utilization per training job
avg by (pod) (DCGM_FI_DEV_GPU_UTIL * on(pod) group_left(job_name) kube_pod_labels{label_job_name=~"$job_name"})

# Training throughput (samples per second)
rate(training_samples_total{job_name=~"$job_name"}[5m])

# Estimated time remaining
(total_epochs - current_epoch) / rate(current_epoch[1h])

Model Serving Dashboard

Monitor inference endpoints with these key panels:

Request rate — Requests per second by model and version
Latency percentiles — p50, p95, p99 inference latency with SLO target lines
Error rate — 4xx and 5xx responses as a percentage of total requests
Batch utilization — How full inference batches are (impacts GPU efficiency)
Model version distribution — Traffic split during canary deployments

Dashboard Variables and Templating

Use Grafana template variables to make dashboards dynamic and reusable across teams and environments:

$cluster — Select which Kubernetes cluster to view
$namespace — Filter by team or project namespace
$gpu_node — Drill down to a specific GPU node
$model_name — Filter serving metrics by model name

Design Tip: Place the most critical panels (GPU utilization, error rates, SLO compliance) at the top of the dashboard. Use row folding to organize detailed drill-down panels below. This ensures on-call engineers see problems immediately.

Ready to Configure Alerts?

The next lesson covers designing alerting rules that catch real problems without generating excessive noise.

Next: Alerts →

← Prometheus Alerts →