Alerts for AI Infrastructure Intermediate

Effective alerting is the difference between catching a failing training job in minutes versus discovering wasted GPU-hours days later. This lesson covers designing alerting rules specifically for ML infrastructure, routing alerts to the right teams, and avoiding alert fatigue through intelligent thresholds and grouping.

Critical GPU Alerts

YAML
groups:
  - name: gpu-alerts
    rules:
      - alert: GPUHighTemperature
        expr: DCGM_FI_DEV_GPU_TEMP > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU temperature above 85C on {{ $labels.gpu }}"

      - alert: GPUIdle
        expr: DCGM_FI_DEV_GPU_UTIL < 5
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.gpu }} idle for 30+ minutes"

      - alert: GPUMemoryExhausted
        expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) > 0.95
        for: 5m
        labels:
          severity: critical

Training Job Alerts

  • Training stall — Alert when loss has not decreased for N epochs, indicating a stuck training run
  • Loss explosion — Alert when loss exceeds a threshold, indicating numerical instability
  • Job failure — Alert when a training job enters a failed state or pods restart repeatedly
  • Checkpoint failure — Alert when checkpointing fails, risking loss of training progress

Inference SLO Alerts

Use SLO-based alerting for model serving to focus on user-visible impact:

YAML
- alert: InferenceLatencySLOBreach
  expr: |
    histogram_quantile(0.99, rate(inference_duration_seconds_bucket[5m])) > 0.5
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "p99 inference latency exceeds 500ms SLO"

- alert: InferenceErrorRateHigh
  expr: |
    rate(inference_errors_total[5m]) / rate(inference_requests_total[5m]) > 0.01
  for: 5m
  labels:
    severity: critical

Alert Routing with Alertmanager

Route alerts to the right team based on severity and ownership:

  • GPU hardware alerts → Infrastructure team via PagerDuty
  • Training job failures → ML engineering team via Slack
  • Serving SLO breaches → On-call SRE via PagerDuty
  • Cost alerts → ML platform team via email
Avoiding Alert Fatigue: Start with a small set of high-signal alerts and expand gradually. Every alert should have a clear runbook describing what to do when it fires. If an alert fires frequently without requiring action, tune the threshold or remove it.

Ready for Advanced Dashboards?

The next lesson covers advanced dashboard design patterns for multi-cluster views, capacity planning, and executive reporting.

Next: Dashboards →