Prompt Attack Evaluation

Evaluate prompt-attack robustness credibly. Learn the canonical benchmarks (HarmBench, AdvBench, JailbreakBench, MLCommons AILuminate, AIR-Bench), the eval-set rotation discipline (do not let the model train on your benchmark), scoring rubrics (refusal / partial-refusal / unsafe), regression discipline across releases, slice eval per language / category, and the failure mode of single-benchmark thinking.

6
Lessons
📋
Templates
Practitioner-Ready
100%
Free

Lessons in This Topic

Work through these 6 lessons in order, or jump to whichever is most relevant.