Alerting Intermediate

Intelligent alerting is the most impactful way to apply AI to network monitoring. This lesson covers how to design alerting systems that use dynamic baselines, composite conditions, SLO-based targets, and ML-based prioritization.

Alert Design Principles

  • Alert on symptoms, not causes — Users care about latency, not CPU. Alert on user-facing metrics first
  • Every alert must be actionable — If there is no action to take, it should not be an alert
  • Include context — Every alert should include what happened, what is affected, and suggested next steps
  • Severity must reflect business impact — Not all problems are equal. Prioritize by service impact

Dynamic Baseline Alerts

TypeHow It WorksExample
Seasonal BaselineML learns hourly/daily/weekly patternsAlert if traffic is 3 sigma below normal for this time of day
Peer ComparisonCompare device against similar devicesAlert if one core router has 2x the error rate of its peers
Trend-BasedDetect changing trendsAlert if memory usage growth rate doubled this week

Composite Alerts

Combine multiple conditions into a single, high-confidence alert:

  • AND conditions — High latency AND packet loss AND error rate increase = link degradation
  • Correlation — Alert only when anomaly is detected on both the metric and related log events
  • Absence detection — Alert when expected data stops arriving (device unreachable)

SLO-Based Alerting

Instead of alerting on individual metrics, define Service Level Objectives and alert when error budgets are being consumed too quickly. For example: "WAN service must maintain 99.95% availability. Alert when error budget burn rate exceeds 2x normal."

On-Call Optimization

AI can optimize the on-call experience by:

  • Routing alerts to the engineer most likely to resolve them quickly
  • Auto-resolving alerts when the condition clears within a grace period
  • Bundling non-urgent alerts into a digest for the next business day
  • Providing predicted time-to-resolution based on historical incidents
The Alert Audit: Review every alert that fired in the past month. If operators consistently ignore or immediately close an alert type, it should be tuned, downgraded, or eliminated. Aim for 90%+ actionability rate.

Next Step

Learn the best practices for designing your overall AI monitoring strategy.

Next: Best Practices →