AI Blue Team Defense Intermediate
While red teams find vulnerabilities, blue teams build the detection and response capabilities that protect AI systems in production. This lesson covers building AI-specific monitoring systems, detecting adversarial inputs and model theft attempts, responding to AI security incidents, and integrating AI security monitoring with existing SIEM and SOC workflows.
AI Security Monitoring Architecture
An effective AI blue team monitoring system should cover multiple layers:
| Monitoring Layer | What to Monitor | Detection Goal |
|---|---|---|
| Input Layer | API requests, input distributions, anomalous patterns | Adversarial inputs, injection attempts, unusual queries |
| Model Layer | Prediction distributions, confidence scores, latency | Model drift, degradation, manipulation |
| Output Layer | Generated content, response patterns, error rates | Jailbreak success, data leakage, policy violations |
| Data Layer | Training data integrity, feature store changes | Data poisoning, unauthorized modifications |
| Infrastructure Layer | Access logs, resource usage, network traffic | Unauthorized access, resource abuse, lateral movement |
Adversarial Input Detection
Detecting adversarial inputs in real time is one of the most challenging blue team tasks:
import numpy as np from scipy import stats class AdversarialDetector: """Detect potential adversarial inputs using statistical methods.""" def __init__(self, reference_distribution): self.ref_mean = np.mean(reference_distribution, axis=0) self.ref_std = np.std(reference_distribution, axis=0) self.threshold = 3.0 def detect(self, input_data): """Check if input deviates from expected distribution.""" z_scores = np.abs((input_data - self.ref_mean) / self.ref_std) max_z = np.max(z_scores) if max_z > self.threshold: return { "suspicious": True, "max_deviation": float(max_z), "action": "flag_for_review" } return {"suspicious": False}
Model Extraction Detection
Detect model theft attempts by analyzing query patterns:
- Query volume anomalies — Sudden spikes in API queries from a single user or IP
- Systematic probing — Queries that systematically explore the input space (grid patterns, boundary probing)
- Distribution analysis — Query inputs that follow synthetic distributions rather than natural data patterns
- Timing patterns — Automated queries with regular intervals typical of extraction scripts
LLM Output Monitoring
For LLM-based systems, monitor outputs for security-relevant patterns:
- Policy violation detection — Scan outputs for content that violates safety policies
- Data leakage detection — Check for patterns matching PII, API keys, or training data in responses
- Instruction leakage — Detect when the model reveals its system prompt or internal instructions
- Anomalous behavior — Flag responses that deviate significantly from expected patterns
AI Incident Response Playbook
PLAYBOOK: AI Model Under Attack DETECTION: Alert triggers: anomalous query patterns, accuracy drop, adversarial input detection, output policy violations TRIAGE (0-15 min): 1. Assess alert severity and scope 2. Identify affected model(s) and endpoints 3. Determine attack type (evasion, extraction, poisoning) 4. Escalate to AI security team if confirmed CONTAINMENT (15-60 min): 1. Enable enhanced logging on affected endpoints 2. Tighten rate limits if extraction is suspected 3. Enable human-in-the-loop review for critical predictions 4. Consider rollback to last known-good model version INVESTIGATION (1-24 hours): 1. Analyze attack inputs and patterns 2. Assess model integrity (has it been degraded?) 3. Check training data pipeline for poisoning 4. Determine scope of data exposure or model leakage RECOVERY: 1. Deploy patched model with improved defenses 2. Update detection rules based on attack patterns 3. Restore normal operations with enhanced monitoring 4. Document lessons learned and update threat model
Ready for Purple Teaming?
The next lesson covers how to combine red and blue team operations for maximum security improvement through collaborative purple teaming.
Next: Purple Teaming →