AI Blue Team Defense Intermediate

While red teams find vulnerabilities, blue teams build the detection and response capabilities that protect AI systems in production. This lesson covers building AI-specific monitoring systems, detecting adversarial inputs and model theft attempts, responding to AI security incidents, and integrating AI security monitoring with existing SIEM and SOC workflows.

AI Security Monitoring Architecture

An effective AI blue team monitoring system should cover multiple layers:

Monitoring Layer What to Monitor Detection Goal
Input Layer API requests, input distributions, anomalous patterns Adversarial inputs, injection attempts, unusual queries
Model Layer Prediction distributions, confidence scores, latency Model drift, degradation, manipulation
Output Layer Generated content, response patterns, error rates Jailbreak success, data leakage, policy violations
Data Layer Training data integrity, feature store changes Data poisoning, unauthorized modifications
Infrastructure Layer Access logs, resource usage, network traffic Unauthorized access, resource abuse, lateral movement

Adversarial Input Detection

Detecting adversarial inputs in real time is one of the most challenging blue team tasks:

Python
import numpy as np
from scipy import stats

class AdversarialDetector:
    """Detect potential adversarial inputs using statistical methods."""

    def __init__(self, reference_distribution):
        self.ref_mean = np.mean(reference_distribution, axis=0)
        self.ref_std = np.std(reference_distribution, axis=0)
        self.threshold = 3.0

    def detect(self, input_data):
        """Check if input deviates from expected distribution."""
        z_scores = np.abs((input_data - self.ref_mean) / self.ref_std)
        max_z = np.max(z_scores)

        if max_z > self.threshold:
            return {
                "suspicious": True,
                "max_deviation": float(max_z),
                "action": "flag_for_review"
            }
        return {"suspicious": False}

Model Extraction Detection

Detect model theft attempts by analyzing query patterns:

  • Query volume anomalies — Sudden spikes in API queries from a single user or IP
  • Systematic probing — Queries that systematically explore the input space (grid patterns, boundary probing)
  • Distribution analysis — Query inputs that follow synthetic distributions rather than natural data patterns
  • Timing patterns — Automated queries with regular intervals typical of extraction scripts

LLM Output Monitoring

For LLM-based systems, monitor outputs for security-relevant patterns:

  • Policy violation detection — Scan outputs for content that violates safety policies
  • Data leakage detection — Check for patterns matching PII, API keys, or training data in responses
  • Instruction leakage — Detect when the model reveals its system prompt or internal instructions
  • Anomalous behavior — Flag responses that deviate significantly from expected patterns

AI Incident Response Playbook

Incident Response Playbook
PLAYBOOK: AI Model Under Attack

DETECTION:
  Alert triggers: anomalous query patterns, accuracy drop,
  adversarial input detection, output policy violations

TRIAGE (0-15 min):
  1. Assess alert severity and scope
  2. Identify affected model(s) and endpoints
  3. Determine attack type (evasion, extraction, poisoning)
  4. Escalate to AI security team if confirmed

CONTAINMENT (15-60 min):
  1. Enable enhanced logging on affected endpoints
  2. Tighten rate limits if extraction is suspected
  3. Enable human-in-the-loop review for critical predictions
  4. Consider rollback to last known-good model version

INVESTIGATION (1-24 hours):
  1. Analyze attack inputs and patterns
  2. Assess model integrity (has it been degraded?)
  3. Check training data pipeline for poisoning
  4. Determine scope of data exposure or model leakage

RECOVERY:
  1. Deploy patched model with improved defenses
  2. Update detection rules based on attack patterns
  3. Restore normal operations with enhanced monitoring
  4. Document lessons learned and update threat model
SIEM Integration: Forward AI security events to your existing SIEM platform. Create custom detection rules for AI-specific attack patterns alongside traditional security monitoring. This gives the SOC team visibility into AI threats within their existing workflow.

Ready for Purple Teaming?

The next lesson covers how to combine red and blue team operations for maximum security improvement through collaborative purple teaming.

Next: Purple Teaming →