Advanced AI Dashboards Advanced
Beyond basic GPU and serving dashboards, ML platforms need advanced visualizations for capacity planning, cost tracking, experiment comparison, and executive reporting. This lesson covers designing these dashboards with Grafana using data from Prometheus, Thanos, and custom data sources.
Capacity Planning Dashboard
Predict when you will run out of GPU capacity and plan procurement accordingly:
- GPU utilization trend — 30/60/90-day trend with linear regression forecast
- Queue depth over time — How many training jobs are waiting for GPU resources
- GPU allocation by team — Pie chart showing which teams consume the most GPU resources
- Projected exhaustion date — Stat panel showing when current capacity will be fully utilized based on growth trends
Multi-Cluster Overview
For organizations running ML workloads across multiple clusters or regions, a unified view is essential:
# Total GPUs across all clusters (via Thanos) sum by (cluster) (count by (cluster, gpu) (DCGM_FI_DEV_GPU_UTIL)) # Cross-cluster GPU utilization comparison avg by (cluster) (DCGM_FI_DEV_GPU_UTIL) # Active training jobs per cluster count by (cluster) (kube_job_status_active{job_name=~".*train.*"} == 1)
Executive Summary Dashboard
Create a high-level dashboard for leadership that focuses on business outcomes:
| Panel | Metric | Purpose |
|---|---|---|
| Total GPU Fleet | Total GPUs, utilization percentage | Asset utilization visibility |
| Models in Production | Count of active serving endpoints | ML maturity indicator |
| Monthly GPU Cost | GPU-hours x cost per hour | Budget tracking |
| Inference Availability | Uptime percentage across all models | Reliability indicator |
Dashboard as Code
Store Grafana dashboards in Git using JSON models or Grafonnet (a Jsonnet library for Grafana). This enables version control, code review, and automated deployment of dashboard changes through your GitOps pipeline.
Ready for Best Practices?
The final lesson covers production monitoring patterns including high availability, long-term storage, and monitoring-as-code.
Next: Best Practices →