Alerting Intermediate
Intelligent alerting is the most impactful way to apply AI to network monitoring. This lesson covers how to design alerting systems that use dynamic baselines, composite conditions, SLO-based targets, and ML-based prioritization.
Alert Design Principles
- Alert on symptoms, not causes — Users care about latency, not CPU. Alert on user-facing metrics first
- Every alert must be actionable — If there is no action to take, it should not be an alert
- Include context — Every alert should include what happened, what is affected, and suggested next steps
- Severity must reflect business impact — Not all problems are equal. Prioritize by service impact
Dynamic Baseline Alerts
| Type | How It Works | Example |
|---|---|---|
| Seasonal Baseline | ML learns hourly/daily/weekly patterns | Alert if traffic is 3 sigma below normal for this time of day |
| Peer Comparison | Compare device against similar devices | Alert if one core router has 2x the error rate of its peers |
| Trend-Based | Detect changing trends | Alert if memory usage growth rate doubled this week |
Composite Alerts
Combine multiple conditions into a single, high-confidence alert:
- AND conditions — High latency AND packet loss AND error rate increase = link degradation
- Correlation — Alert only when anomaly is detected on both the metric and related log events
- Absence detection — Alert when expected data stops arriving (device unreachable)
SLO-Based Alerting
Instead of alerting on individual metrics, define Service Level Objectives and alert when error budgets are being consumed too quickly. For example: "WAN service must maintain 99.95% availability. Alert when error budget burn rate exceeds 2x normal."
On-Call Optimization
AI can optimize the on-call experience by:
- Routing alerts to the engineer most likely to resolve them quickly
- Auto-resolving alerts when the condition clears within a grace period
- Bundling non-urgent alerts into a digest for the next business day
- Providing predicted time-to-resolution based on historical incidents
The Alert Audit: Review every alert that fired in the past month. If operators consistently ignore or immediately close an alert type, it should be tuned, downgraded, or eliminated. Aim for 90%+ actionability rate.
Next Step
Learn the best practices for designing your overall AI monitoring strategy.
Next: Best Practices →