Alerts for AI Infrastructure Intermediate
Effective alerting is the difference between catching a failing training job in minutes versus discovering wasted GPU-hours days later. This lesson covers designing alerting rules specifically for ML infrastructure, routing alerts to the right teams, and avoiding alert fatigue through intelligent thresholds and grouping.
Critical GPU Alerts
YAML
groups: - name: gpu-alerts rules: - alert: GPUHighTemperature expr: DCGM_FI_DEV_GPU_TEMP > 85 for: 5m labels: severity: warning annotations: summary: "GPU temperature above 85C on {{ $labels.gpu }}" - alert: GPUIdle expr: DCGM_FI_DEV_GPU_UTIL < 5 for: 30m labels: severity: warning annotations: summary: "GPU {{ $labels.gpu }} idle for 30+ minutes" - alert: GPUMemoryExhausted expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) > 0.95 for: 5m labels: severity: critical
Training Job Alerts
- Training stall — Alert when loss has not decreased for N epochs, indicating a stuck training run
- Loss explosion — Alert when loss exceeds a threshold, indicating numerical instability
- Job failure — Alert when a training job enters a failed state or pods restart repeatedly
- Checkpoint failure — Alert when checkpointing fails, risking loss of training progress
Inference SLO Alerts
Use SLO-based alerting for model serving to focus on user-visible impact:
YAML
- alert: InferenceLatencySLOBreach expr: | histogram_quantile(0.99, rate(inference_duration_seconds_bucket[5m])) > 0.5 for: 10m labels: severity: critical annotations: summary: "p99 inference latency exceeds 500ms SLO" - alert: InferenceErrorRateHigh expr: | rate(inference_errors_total[5m]) / rate(inference_requests_total[5m]) > 0.01 for: 5m labels: severity: critical
Alert Routing with Alertmanager
Route alerts to the right team based on severity and ownership:
- GPU hardware alerts → Infrastructure team via PagerDuty
- Training job failures → ML engineering team via Slack
- Serving SLO breaches → On-call SRE via PagerDuty
- Cost alerts → ML platform team via email
Avoiding Alert Fatigue: Start with a small set of high-signal alerts and expand gradually. Every alert should have a clear runbook describing what to do when it fires. If an alert fires frequently without requiring action, tune the threshold or remove it.
Ready for Advanced Dashboards?
The next lesson covers advanced dashboard design patterns for multi-cluster views, capacity planning, and executive reporting.
Next: Dashboards →