Constitutional AI
Explore how Constitutional AI (CAI), RLHF, and value alignment training create models with built-in resistance to jailbreaks — moving safety from the prompt layer into the model's weights.
From Prompt-Level to Model-Level Safety
System prompt hardening is essential but insufficient on its own. The most robust defense comes from training models that inherently understand and respect safety boundaries, not just follow instructions about them.
RLHF (Reinforcement Learning from Human Feedback)
RLHF is the foundational technique that transformed raw language models into aligned assistants:
# Step 1: Supervised Fine-Tuning (SFT) Base Model + Human-written examples → SFT Model # Step 2: Reward Model Training SFT Model generates multiple responses Human raters rank responses by quality and safety Rankings train a Reward Model # Step 3: PPO Optimization SFT Model generates responses Reward Model scores each response PPO updates model to maximize reward score Result: RLHF-Aligned Model
RLHF Limitations for Jailbreak Prevention
- Reward hacking: Models may learn to appear safe without genuinely understanding safety
- Distribution shift: Jailbreak prompts are often out-of-distribution from training examples
- Scaling costs: Human feedback is expensive and difficult to scale to cover all attack vectors
- Annotator disagreement: Raters may disagree on what constitutes a harmful response
Constitutional AI (CAI)
Developed by Anthropic, Constitutional AI addresses RLHF limitations by using a set of principles (a "constitution") to guide the model's self-improvement:
# Phase 1: Supervised Learning from Principles Step 1: Model generates initial response (may be harmful) Step 2: Model critiques its own response using constitutional principles (e.g., "Is this response harmful?") Step 3: Model revises response based on self-critique Step 4: Revised responses become training data # Phase 2: RLAIF (RL from AI Feedback) Step 1: Model generates multiple responses Step 2: AI evaluates responses against constitution Step 3: AI-generated preferences train a reward model Step 4: RL optimizes the model using this reward model
Example Constitutional Principles
| Principle | Purpose |
|---|---|
| "Choose the response that is least likely to be used for harmful purposes" | Reduces harmful content generation |
| "Choose the response that is most respectful of everyone's autonomy" | Prevents manipulation and coercion |
| "Choose the response that is most honest and transparent" | Reduces deception and hallucination |
| "Choose the response that would be judged most suitable by a thoughtful senior employee" | General safety and appropriateness |
Value Alignment and Robustness
Models trained with CAI and RLHF develop several properties that make jailbreaking harder:
Internalized Values
Safety behaviors are embedded in model weights, not just in prompt instructions. The model "understands" why certain content is harmful.
Generalized Refusal
The model can recognize novel harmful requests even if they do not match any specific pattern it was trained on.
Consistent Behavior
CAI-trained models maintain safety behaviors across long conversations and context switches where prompt-based defenses weaken.
Nuanced Judgment
Rather than rigid keyword blocking, the model can assess the intent and context of requests with more nuance.
Combining Model-Level and Application-Level Defenses
The strongest jailbreak prevention combines both approaches:
- Model-level: CAI/RLHF training provides a strong baseline of safety
- System prompt: Hardened prompts add application-specific rules
- Input filtering: Catches known attack patterns before they reach the model
- Output monitoring: Validates model responses for safety compliance
- Continuous red teaming: Identifies new vulnerabilities as they emerge