AI Red Teaming

Master AI red teaming as a defensive discipline. 60 deep dives across 360 lessons covering foundations (RT vs blue / pen-test / audit, history, ethics, deliverables), program design (charter, recruitment, scoping, legal authorization & safe harbor, bug bounty, external vs internal), AI threat modeling (attack surface, threat actors, attack trees, kill chain, abuse cases, prioritisation), prompt-based attacks (direct & indirect injection, jailbreak taxonomy, encoding / obfuscation, multi-turn, universal suffixes, eval), model & system attacks (extraction, training-data extraction, inversion, membership inference, backdoors, supply chain), agent & tool-use attacks (agent hijacking, tool abuse, multi-agent, computer use, MCP, agent eval), multimodal attacks (vision, audio, document, deepfake, OCR / typography, multimodal eval), capability & safety evaluations (dangerous capability, CBRN uplift, cyber, persuasion, autonomy, deception, frontier suites), reporting & operations (finding lifecycle, severity, repro, responsible disclosure, vendor coordination, public disclosure), and tools, industry & future (Garak / PyRIT / Inspect, AISI & frontier patterns, NIST AI RMF / MITRE ATLAS / OWASP LLM Top 10, future of AI RT).

60Topics
360Lessons
10Categories
100%Free

AI red teaming is the defensive discipline of probing AI systems for safety, security, and policy failures — and turning those findings into fixes, evals, and disclosures that make the next release better. It sits at the intersection of classical security red-teaming (rules of engagement, kill chains, attack trees, responsible disclosure), adversarial-ML research (model extraction, training-data extraction, membership inference, adversarial perturbations), prompt-engineering tradecraft (direct and indirect injection, jailbreak taxonomy, multi-turn priming), agent-tool evaluation (Computer Use, MCP, multi-agent settings), and the operational machinery that runs in production (severity rubrics, repro packets, fix tracking, vendor coordination, public disclosure). Over the last three years it has stopped being an academic side topic and become an operating commitment for every serious AI deployment: frontier labs publish red-team findings in system cards, governments stand up AI Safety Institutes that run pre-deployment evaluations, regulators write red-teaming duties into law, and enterprises require evidence in procurement.

This track is written for the practitioners doing this work day to day: AI red teamers, security researchers extending into AI, ML engineers building eval harnesses, T&S detection engineers, RAI leads writing safety evaluations, frontier-lab safety teams, AISI evaluators, and program leaders standing up red-team functions. Every topic explains the underlying discipline (drawing on the canonical literature — adversarial-ML research, MITRE ATLAS, OWASP LLM Top 10, NIST AI RMF, AISI publications, frontier-lab system cards), the practical methodology that operationalises it, the defensive implications, and the failure modes where red-team work quietly fails to change the product. Content is conceptual and methodological — it covers attack categories at the level of taxonomy and defence implications, not as step-by-step exploit recipes. The aim is that a reader can stand up a credible AI red-team function, integrate it with engineering and governance, and defend it to boards, regulators, and customers.

All Topics

60 AI red teaming topics organized into 10 categories. Each has 6 detailed lessons with frameworks, methodologies, and operational patterns.

Red Team Foundations

Program Design

AI Threat Modeling for Red Teams

Prompt-Based Attacks

📋

Direct Prompt Injection

Reason about direct prompt injection as a defender. Learn the attack family conceptually, why it persists, the OWASP LLM01 framing, eval patterns, and the layered defences that actually help.

6 Lessons
📋

Indirect Prompt Injection

Reason about indirect prompt injection. Learn the attack family (instructions in retrieved or fetched content), the agent-trust problem, the canonical scenarios, and defence patterns.

6 Lessons
📚

Jailbreak Taxonomy

Learn the jailbreak taxonomy. Persona / role-play, hypothetical / fictional, encoding, multi-turn priming, indirect-context, latent-space — the categories defenders need to evaluate against.

6 Lessons
🔐

Encoding & Obfuscation

Reason about encoding and obfuscation as attack categories. Learn the conceptual classes (base64, leetspeak, homoglyph, low-resource language, cipher), and the input-canonicalisation defence pattern.

6 Lessons
💬

Multi-Turn Attack Patterns

Reason about multi-turn attack patterns. Learn the priming / commitment-escalation pattern, context-window exploitation, persona-drift exploitation, and the conversation-monitoring defence.

6 Lessons
📝

Universal Adversarial Suffixes

Reason about universal adversarial suffixes (Zou et al. style). Learn the concept, how researchers find them, why models are vulnerable, transferability claims, and defensive eval discipline.

6 Lessons
📊

Prompt Attack Evaluation

Evaluate prompt-attack robustness credibly. Learn benchmarks (HarmBench, AdvBench, JailbreakBench), eval set rotation, scoring rubrics, regression discipline, and slice eval.

6 Lessons

Model & System Attacks

Agent & Tool-Use Attacks

Multimodal Attacks

Capability & Safety Evaluations

Dangerous Capability Evaluations

Run dangerous-capability evaluations. Learn the canonical categories (CBRN, cyber, autonomy, persuasion), eval design rules, elicitation discipline, and result-disclosure ethics.

6 Lessons

CBRN Uplift Evaluation

Evaluate CBRN (chem / bio / radiological / nuclear) uplift conceptually. Learn the eval framing, expert review, the uplift-vs-baseline measurement, and the strict disclosure-control practice.

6 Lessons
🔐

Cyber Capability Evaluation

Evaluate cyber-offensive capabilities. Learn the eval categories (vulnerability discovery, exploit dev, social engineering, autonomous operations), CTF-style harnesses, and disclosure ethics.

6 Lessons
📢

Persuasion & Influence Evaluation

Evaluate persuasion and influence capability. Learn opinion-shift studies, IRB-style ethics, the eval design constraints, the AI-vs-human baseline, and policy-facing reporting.

6 Lessons
💻

Autonomy & Self-Replication Evaluation

Evaluate autonomy and self-replication capability. Learn the canonical task families (METR-style), success-rate measurement, sandbox containment, and threshold-tied safety commitments.

6 Lessons
🕵

Deception & Scheming Evaluation

Evaluate deception and scheming behaviour. Learn the conceptual taxonomy, eval methodologies (sandboxed setups, behavioural probes), interpretability probes, and disclosure norms.

6 Lessons
🌟

Frontier Lab Evaluation Suites

Read and replicate frontier-lab eval suites. Learn the canonical suites (Anthropic, OpenAI, GDM, US AISI, UK AISI), comparability, eval reproducibility, and the public-record use case.

6 Lessons

Reporting & Operations

Tools, Industry & Future