AI Red Teaming
Master AI red teaming as a defensive discipline. 60 deep dives across 360 lessons covering foundations (RT vs blue / pen-test / audit, history, ethics, deliverables), program design (charter, recruitment, scoping, legal authorization & safe harbor, bug bounty, external vs internal), AI threat modeling (attack surface, threat actors, attack trees, kill chain, abuse cases, prioritisation), prompt-based attacks (direct & indirect injection, jailbreak taxonomy, encoding / obfuscation, multi-turn, universal suffixes, eval), model & system attacks (extraction, training-data extraction, inversion, membership inference, backdoors, supply chain), agent & tool-use attacks (agent hijacking, tool abuse, multi-agent, computer use, MCP, agent eval), multimodal attacks (vision, audio, document, deepfake, OCR / typography, multimodal eval), capability & safety evaluations (dangerous capability, CBRN uplift, cyber, persuasion, autonomy, deception, frontier suites), reporting & operations (finding lifecycle, severity, repro, responsible disclosure, vendor coordination, public disclosure), and tools, industry & future (Garak / PyRIT / Inspect, AISI & frontier patterns, NIST AI RMF / MITRE ATLAS / OWASP LLM Top 10, future of AI RT).
AI red teaming is the defensive discipline of probing AI systems for safety, security, and policy failures — and turning those findings into fixes, evals, and disclosures that make the next release better. It sits at the intersection of classical security red-teaming (rules of engagement, kill chains, attack trees, responsible disclosure), adversarial-ML research (model extraction, training-data extraction, membership inference, adversarial perturbations), prompt-engineering tradecraft (direct and indirect injection, jailbreak taxonomy, multi-turn priming), agent-tool evaluation (Computer Use, MCP, multi-agent settings), and the operational machinery that runs in production (severity rubrics, repro packets, fix tracking, vendor coordination, public disclosure). Over the last three years it has stopped being an academic side topic and become an operating commitment for every serious AI deployment: frontier labs publish red-team findings in system cards, governments stand up AI Safety Institutes that run pre-deployment evaluations, regulators write red-teaming duties into law, and enterprises require evidence in procurement.
This track is written for the practitioners doing this work day to day: AI red teamers, security researchers extending into AI, ML engineers building eval harnesses, T&S detection engineers, RAI leads writing safety evaluations, frontier-lab safety teams, AISI evaluators, and program leaders standing up red-team functions. Every topic explains the underlying discipline (drawing on the canonical literature — adversarial-ML research, MITRE ATLAS, OWASP LLM Top 10, NIST AI RMF, AISI publications, frontier-lab system cards), the practical methodology that operationalises it, the defensive implications, and the failure modes where red-team work quietly fails to change the product. Content is conceptual and methodological — it covers attack categories at the level of taxonomy and defence implications, not as step-by-step exploit recipes. The aim is that a reader can stand up a credible AI red-team function, integrate it with engineering and governance, and defend it to boards, regulators, and customers.
All Topics
60 AI red teaming topics organized into 10 categories. Each has 6 detailed lessons with frameworks, methodologies, and operational patterns.
Red Team Foundations
AI Red Teaming Overview
Master what AI red teaming actually is. Learn the scope, the lineage from security and intelligence red-teaming, the deliverables, and the operating model used by mature programs.
6 LessonsRed Team vs Blue Team vs Purple
Distinguish red team from blue team and purple-team collaboration. Learn the boundary, the handoff between teams, the eval-vs-RT split, and the patterns that prevent finger-pointing.
6 LessonsRed Team vs Pen Test vs Audit vs Evals
Disentangle red team from related disciplines. Learn how AI red teaming relates to penetration testing, third-party audits, internal evaluations, and bug bounty programs.
6 LessonsHistory & Evolution of AI Red Teaming
Trace AI red teaming from adversarial-ML research to a regulated discipline. Learn the milestones (Goodfellow et al., MS Tay, GPT-4 system card, AISI launch) and the lessons each cemented.
6 LessonsRed Team Ethics & Authorization
Operate red teams ethically. Learn authorization rigour, scope respect, dual-use research ethics, the do-no-harm rule, target consent, and the boundary you do not cross.
6 LessonsDeliverables & Operating Cadence
Run red team work as a delivery practice. Learn the deliverable set (findings, repros, severity, fix tracking), the operating cadence, the briefing rhythm, and the year-end retrospective.
6 LessonsProgram Design
Red Team Program Design
Stand up an AI red team program. Learn the program charter, headcount and skills mix, reporting line, budget, KPIs, the link to safety / security / RAI, and the maturity ladder.
6 LessonsRecruitment & Skills Mix
Recruit a red team that covers the surface. Learn the skill archetypes (ML, security, social, domain), the diversity-of-perspective requirement, and the link from recruitment to coverage.
6 LessonsEngagement Scoping
Scope an engagement that produces real findings. Learn the rules-of-engagement document, target system definition, in-scope / out-of-scope behaviours, time-boxing, and acceptance criteria.
6 LessonsLegal Authorization & Safe Harbor
Get the legal foundation right. Learn the authorization letter, indemnity / safe-harbor clauses, CFAA and equivalents, third-party data, and cross-border legal considerations.
6 LessonsAI Bug Bounty Programs
Run an AI bug bounty program. Learn scope authoring, payout structure, severity rubrics, the duplicate-finding workflow, eligibility, and the link from external research to internal fixes.
6 LessonsExternal vs Internal Red Teams
Mix external and internal red-team work. Learn the strengths and limits of each, the rotation pattern, the credibility argument for external probes, and the embedded-vs-engagement decision.
6 LessonsAI Threat Modeling for Red Teams
AI System Attack Surface Mapping
Map the AI attack surface end to end. Learn the canonical layers (input, model, output, agent, infrastructure, supply chain), per-surface entry points, and the coverage matrix.
6 LessonsThreat Actor Modeling
Model adversaries the way red-team work needs. Learn the actor taxonomy, capability tiers, motivation profiles, the kill-chain mapping, and the link to engagement scoping.
6 LessonsAttack Trees for AI
Build attack trees for AI systems. Learn the tree-building methodology, AND/OR nodes, leaf-cost annotation, the path-cost calculation, and the link to fix prioritisation.
6 LessonsAI Kill Chain
Apply kill-chain reasoning to AI attacks. Learn the canonical phases (recon, gain-foothold, escalate, monetise/spread, persist), the AI-specific adaptations, and the detection-by-phase pattern.
6 LessonsAbuse Case Design
Write abuse cases the way you write user stories. Learn the abuse-case template, harm-anchored framing, success criteria, the abuse-case-to-probe pipeline, and the review ritual.
6 LessonsThreat Prioritization for Red Teams
Prioritise red team work credibly. Learn risk scoring, capability * exposure * severity, regulator-attention weights, public-attention weights, and the quarterly prioritisation ritual.
6 LessonsPrompt-Based Attacks
Direct Prompt Injection
Reason about direct prompt injection as a defender. Learn the attack family conceptually, why it persists, the OWASP LLM01 framing, eval patterns, and the layered defences that actually help.
6 LessonsIndirect Prompt Injection
Reason about indirect prompt injection. Learn the attack family (instructions in retrieved or fetched content), the agent-trust problem, the canonical scenarios, and defence patterns.
6 LessonsJailbreak Taxonomy
Learn the jailbreak taxonomy. Persona / role-play, hypothetical / fictional, encoding, multi-turn priming, indirect-context, latent-space — the categories defenders need to evaluate against.
6 LessonsEncoding & Obfuscation
Reason about encoding and obfuscation as attack categories. Learn the conceptual classes (base64, leetspeak, homoglyph, low-resource language, cipher), and the input-canonicalisation defence pattern.
6 LessonsMulti-Turn Attack Patterns
Reason about multi-turn attack patterns. Learn the priming / commitment-escalation pattern, context-window exploitation, persona-drift exploitation, and the conversation-monitoring defence.
6 LessonsUniversal Adversarial Suffixes
Reason about universal adversarial suffixes (Zou et al. style). Learn the concept, how researchers find them, why models are vulnerable, transferability claims, and defensive eval discipline.
6 LessonsPrompt Attack Evaluation
Evaluate prompt-attack robustness credibly. Learn benchmarks (HarmBench, AdvBench, JailbreakBench), eval set rotation, scoring rubrics, regression discipline, and slice eval.
6 LessonsModel & System Attacks
Model Extraction Attacks
Reason about model extraction. Learn the conceptual attack (query-and-imitate), the IP and safety implications, query-budget defences, watermarking research, and the legal landscape.
6 LessonsTraining Data Extraction
Reason about training-data extraction. Learn the canonical research (Carlini et al.), memorisation drivers, evaluation methodology, and the defence stack (deduplication, DP, output filters).
6 LessonsModel Inversion
Reason about model inversion. Learn the conceptual attack against prediction APIs, the privacy implications, the evaluation methodology, and the defence stack.
6 LessonsMembership Inference (Red Team Lens)
Apply membership-inference attacks for evaluation. Learn shadow-model methodology, LiRA, the link to overfitting and duplication, and the defence pattern (DP, regularisation).
6 LessonsBackdoor & Trojan Detection
Hunt for backdoors and trojans in third-party models. Learn the conceptual threat, neural-cleanse-style detection, weight inspection, dataset inspection, and the supply-chain framing.
6 LessonsModel Supply Chain Red Teaming
Red-team the model supply chain. Learn provenance tracking, model-card review, weights integrity, third-party fine-tunes, the SBOM-equivalent for AI, and the procurement linkage.
6 LessonsAgent & Tool-Use Attacks
Agent Hijacking
Reason about agent hijacking. Learn the conceptual attack family, the agent-trust boundary, real-world precedents from frontier-lab disclosures, and the containment-defence pattern.
6 LessonsTool Abuse Patterns
Map tool-abuse patterns conceptually. Learn the family (over-permissioned tools, unsanitised tool outputs, missing rate-limits, ambiguous tool semantics), and the tool-design hardening pattern.
6 LessonsMulti-Agent Attack Patterns
Red-team multi-agent systems. Learn the conceptual attack classes (compromised peer, role-confusion, supervisor-injection, broker exploitation), and the multi-agent containment pattern.
6 LessonsComputer Use & Browser Agent Attacks
Red-team computer-use and browser agents. Learn the conceptual attack surface (on-screen injection, OS dialogs, hostile pages, file system, credentials), and the strict-sandbox defence.
6 LessonsMCP Attack Surface
Red-team Model Context Protocol (MCP) integrations. Learn the conceptual surface (server impersonation, tool-description injection, resource exfiltration), and the defensive review pattern.
6 LessonsAgent Red-Team Evaluations
Evaluate agentic AI under adversarial pressure. Learn agent benchmarks under attack (METR, SWE-bench, AgentBench), red-team-specific harnesses, scoring, and operational guard rails.
6 LessonsMultimodal Attacks
Vision-Based Jailbreaks
Reason about vision-based jailbreaks. Learn the conceptual classes (typography in image, OCR-driven, adversarial perturbation, visual obfuscation), and the multimodal defence layers.
6 LessonsAudio Attack Patterns
Reason about audio attack patterns. Learn ultrasonic / inaudible commands, ASR misdirection, voice-clone-as-input, the unique surface for voice agents, and the defence stack.
6 LessonsDocument & PDF Attacks
Red-team document and PDF surfaces. Learn the conceptual surface (hidden text, layered images, malicious metadata, hostile attachments), and the document-handling hardening pattern.
6 LessonsDeepfake-as-Input Attacks
Red-team systems that take audio / video input where deepfakes are the threat. Learn the eval set, the C2PA / provenance defence, liveness checks, and the verification-tier pattern.
6 LessonsOCR & Typography Attacks
Reason about OCR and typography as an attack surface. Learn the conceptual cases (homoglyphs, RTL embedding, zero-width chars, similar-looking glyphs), and canonicalisation defences.
6 LessonsMultimodal Red-Team Evaluation
Evaluate multimodal systems under adversarial pressure. Learn cross-modal eval design, slice eval per modality, the modality-bridging attack class, and reporting standards.
6 LessonsCapability & Safety Evaluations
Dangerous Capability Evaluations
Run dangerous-capability evaluations. Learn the canonical categories (CBRN, cyber, autonomy, persuasion), eval design rules, elicitation discipline, and result-disclosure ethics.
6 LessonsCBRN Uplift Evaluation
Evaluate CBRN (chem / bio / radiological / nuclear) uplift conceptually. Learn the eval framing, expert review, the uplift-vs-baseline measurement, and the strict disclosure-control practice.
6 LessonsCyber Capability Evaluation
Evaluate cyber-offensive capabilities. Learn the eval categories (vulnerability discovery, exploit dev, social engineering, autonomous operations), CTF-style harnesses, and disclosure ethics.
6 LessonsPersuasion & Influence Evaluation
Evaluate persuasion and influence capability. Learn opinion-shift studies, IRB-style ethics, the eval design constraints, the AI-vs-human baseline, and policy-facing reporting.
6 LessonsAutonomy & Self-Replication Evaluation
Evaluate autonomy and self-replication capability. Learn the canonical task families (METR-style), success-rate measurement, sandbox containment, and threshold-tied safety commitments.
6 LessonsDeception & Scheming Evaluation
Evaluate deception and scheming behaviour. Learn the conceptual taxonomy, eval methodologies (sandboxed setups, behavioural probes), interpretability probes, and disclosure norms.
6 LessonsFrontier Lab Evaluation Suites
Read and replicate frontier-lab eval suites. Learn the canonical suites (Anthropic, OpenAI, GDM, US AISI, UK AISI), comparability, eval reproducibility, and the public-record use case.
6 LessonsReporting & Operations
Finding Lifecycle (Discover → Fix)
Run findings as a delivery pipeline. Learn the finding lifecycle, intake, triage, repro, severity, fix tracking, regression-test creation, and the close-out criterion that holds.
6 LessonsSeverity Rubrics for AI Findings
Score AI findings consistently. Learn severity-rubric design (impact * exposure * exploitability), AI-specific dimensions, the OWASP / CVSS adaptation, and the calibration ritual.
6 LessonsReproducible Repro Packages
Build repro packets engineering can act on. Learn the packet contents (prompt, expected vs actual, environment, model version, harness), the privacy redaction step, and the audit trail.
6 LessonsResponsible Disclosure for AI
Disclose AI findings responsibly. Learn the AI-specific adaptation of CVD, embargo windows, multi-vendor coordination, the ethics of deferring publication, and the public-disclosure decision.
6 LessonsVendor Coordination
Coordinate with vendors and frontier labs. Learn intake channels, expected response SLAs, escalation paths, the abuse-of-disclosure failure mode, and the long-term relationship management.
6 LessonsPublic Disclosure & System Cards
Contribute to public disclosure (system cards, transparency reports, advisories). Learn the audience-specific framing, the language discipline, the verification ritual, and the legal review pattern.
6 LessonsTools, Industry & Future
Red Team Tools & Frameworks
Navigate the red-team tools landscape. Learn Garak, PyRIT, Inspect (UK AISI), Promptfoo, custom harnesses, the build-vs-adopt decision, and the integration into CI / pre-release gates.
6 LessonsAISI & Frontier-Lab Patterns
Read AISI and frontier-lab patterns. Learn US AISI, UK AISI, EU AISO, the pre-deployment evaluation pattern, contractual access, and the public-record approach.
6 LessonsRed Team Standards & Frameworks
Adopt red-team standards and frameworks. Learn NIST AI RMF, MITRE ATLAS, OWASP LLM Top 10, ISO/IEC 23894, AISI patterns, and the standards-mapping discipline.
6 LessonsFuture of AI Red Teaming
Reason about where AI red teaming is heading. Learn the autonomous-agent threat curve, frontier-capability eval consolidation, the regulator-driven floor, and the strategic-posture template.
6 Lessons
Lilly Tech Systems