Advanced AI Dashboards Advanced

Beyond basic GPU and serving dashboards, ML platforms need advanced visualizations for capacity planning, cost tracking, experiment comparison, and executive reporting. This lesson covers designing these dashboards with Grafana using data from Prometheus, Thanos, and custom data sources.

Capacity Planning Dashboard

Predict when you will run out of GPU capacity and plan procurement accordingly:

  • GPU utilization trend — 30/60/90-day trend with linear regression forecast
  • Queue depth over time — How many training jobs are waiting for GPU resources
  • GPU allocation by team — Pie chart showing which teams consume the most GPU resources
  • Projected exhaustion date — Stat panel showing when current capacity will be fully utilized based on growth trends

Multi-Cluster Overview

For organizations running ML workloads across multiple clusters or regions, a unified view is essential:

PromQL
# Total GPUs across all clusters (via Thanos)
sum by (cluster) (count by (cluster, gpu) (DCGM_FI_DEV_GPU_UTIL))

# Cross-cluster GPU utilization comparison
avg by (cluster) (DCGM_FI_DEV_GPU_UTIL)

# Active training jobs per cluster
count by (cluster) (kube_job_status_active{job_name=~".*train.*"} == 1)

Executive Summary Dashboard

Create a high-level dashboard for leadership that focuses on business outcomes:

Panel Metric Purpose
Total GPU Fleet Total GPUs, utilization percentage Asset utilization visibility
Models in Production Count of active serving endpoints ML maturity indicator
Monthly GPU Cost GPU-hours x cost per hour Budget tracking
Inference Availability Uptime percentage across all models Reliability indicator

Dashboard as Code

Store Grafana dashboards in Git using JSON models or Grafonnet (a Jsonnet library for Grafana). This enables version control, code review, and automated deployment of dashboard changes through your GitOps pipeline.

Pro Tip: Use Grafana dashboard links to connect related dashboards. A cluster overview panel should link to the node detail dashboard; a training job row should link to the experiment tracking dashboard. This creates a natural drill-down workflow.

Ready for Best Practices?

The final lesson covers production monitoring patterns including high availability, long-term storage, and monitoring-as-code.

Next: Best Practices →