Alerting Intermediate

Intelligent alerting is the most impactful way to apply AI to network monitoring. This lesson covers how to design alerting systems that use dynamic baselines, composite conditions, SLO-based targets, and ML-based prioritization.

Alert Design Principles

Alert on symptoms, not causes — Users care about latency, not CPU. Alert on user-facing metrics first
Every alert must be actionable — If there is no action to take, it should not be an alert
Include context — Every alert should include what happened, what is affected, and suggested next steps
Severity must reflect business impact — Not all problems are equal. Prioritize by service impact

Dynamic Baseline Alerts

Type	How It Works	Example
Seasonal Baseline	ML learns hourly/daily/weekly patterns	Alert if traffic is 3 sigma below normal for this time of day
Peer Comparison	Compare device against similar devices	Alert if one core router has 2x the error rate of its peers
Trend-Based	Detect changing trends	Alert if memory usage growth rate doubled this week

Composite Alerts

Combine multiple conditions into a single, high-confidence alert:

AND conditions — High latency AND packet loss AND error rate increase = link degradation
Correlation — Alert only when anomaly is detected on both the metric and related log events
Absence detection — Alert when expected data stops arriving (device unreachable)

SLO-Based Alerting

Instead of alerting on individual metrics, define Service Level Objectives and alert when error budgets are being consumed too quickly. For example: "WAN service must maintain 99.95% availability. Alert when error budget burn rate exceeds 2x normal."

On-Call Optimization

AI can optimize the on-call experience by:

Routing alerts to the engineer most likely to resolve them quickly
Auto-resolving alerts when the condition clears within a grace period
Bundling non-urgent alerts into a digest for the next business day
Providing predicted time-to-resolution based on historical incidents

The Alert Audit: Review every alert that fired in the past month. If operators consistently ignore or immediately close an alert type, it should be tuned, downgraded, or eliminated. Aim for 90%+ actionability rate.

Next Step

Learn the best practices for designing your overall AI monitoring strategy.

Next: Best Practices →

← Prometheus + ML Best Practices →