Advanced AI Dashboards Advanced

Beyond basic GPU and serving dashboards, ML platforms need advanced visualizations for capacity planning, cost tracking, experiment comparison, and executive reporting. This lesson covers designing these dashboards with Grafana using data from Prometheus, Thanos, and custom data sources.

Capacity Planning Dashboard

Predict when you will run out of GPU capacity and plan procurement accordingly:

GPU utilization trend — 30/60/90-day trend with linear regression forecast
Queue depth over time — How many training jobs are waiting for GPU resources
GPU allocation by team — Pie chart showing which teams consume the most GPU resources
Projected exhaustion date — Stat panel showing when current capacity will be fully utilized based on growth trends

Multi-Cluster Overview

For organizations running ML workloads across multiple clusters or regions, a unified view is essential:

PromQL

# Total GPUs across all clusters (via Thanos)
sum by (cluster) (count by (cluster, gpu) (DCGM_FI_DEV_GPU_UTIL))

# Cross-cluster GPU utilization comparison
avg by (cluster) (DCGM_FI_DEV_GPU_UTIL)

# Active training jobs per cluster
count by (cluster) (kube_job_status_active{job_name=~".*train.*"} == 1)

Executive Summary Dashboard

Create a high-level dashboard for leadership that focuses on business outcomes:

Panel	Metric	Purpose
Total GPU Fleet	Total GPUs, utilization percentage	Asset utilization visibility
Models in Production	Count of active serving endpoints	ML maturity indicator
Monthly GPU Cost	GPU-hours x cost per hour	Budget tracking
Inference Availability	Uptime percentage across all models	Reliability indicator

Dashboard as Code

Store Grafana dashboards in Git using JSON models or Grafonnet (a Jsonnet library for Grafana). This enables version control, code review, and automated deployment of dashboard changes through your GitOps pipeline.

Pro Tip: Use Grafana dashboard links to connect related dashboards. A cluster overview panel should link to the node detail dashboard; a training job row should link to the experiment tracking dashboard. This creates a natural drill-down workflow.

Ready for Best Practices?

The final lesson covers production monitoring patterns including high availability, long-term storage, and monitoring-as-code.

Next: Best Practices →

← Alerts Best Practices →