Advanced

Constitutional AI

Explore how Constitutional AI (CAI), RLHF, and value alignment training create models with built-in resistance to jailbreaks — moving safety from the prompt layer into the model's weights.

From Prompt-Level to Model-Level Safety

System prompt hardening is essential but insufficient on its own. The most robust defense comes from training models that inherently understand and respect safety boundaries, not just follow instructions about them.

💡
Key insight: A model that "wants" to be safe is harder to jailbreak than a model that is merely "told" to be safe. Constitutional AI and RLHF embed safety values directly into the model's behavior patterns.

RLHF (Reinforcement Learning from Human Feedback)

RLHF is the foundational technique that transformed raw language models into aligned assistants:

RLHF Training Pipeline
# Step 1: Supervised Fine-Tuning (SFT)
Base Model + Human-written examples → SFT Model

# Step 2: Reward Model Training
SFT Model generates multiple responses
Human raters rank responses by quality and safety
Rankings train a Reward Model

# Step 3: PPO Optimization
SFT Model generates responses
Reward Model scores each response
PPO updates model to maximize reward score
Result: RLHF-Aligned Model

RLHF Limitations for Jailbreak Prevention

  • Reward hacking: Models may learn to appear safe without genuinely understanding safety
  • Distribution shift: Jailbreak prompts are often out-of-distribution from training examples
  • Scaling costs: Human feedback is expensive and difficult to scale to cover all attack vectors
  • Annotator disagreement: Raters may disagree on what constitutes a harmful response

Constitutional AI (CAI)

Developed by Anthropic, Constitutional AI addresses RLHF limitations by using a set of principles (a "constitution") to guide the model's self-improvement:

Constitutional AI Process
# Phase 1: Supervised Learning from Principles
Step 1: Model generates initial response (may be harmful)
Step 2: Model critiques its own response using constitutional
        principles (e.g., "Is this response harmful?")
Step 3: Model revises response based on self-critique
Step 4: Revised responses become training data

# Phase 2: RLAIF (RL from AI Feedback)
Step 1: Model generates multiple responses
Step 2: AI evaluates responses against constitution
Step 3: AI-generated preferences train a reward model
Step 4: RL optimizes the model using this reward model

Example Constitutional Principles

Principle Purpose
"Choose the response that is least likely to be used for harmful purposes" Reduces harmful content generation
"Choose the response that is most respectful of everyone's autonomy" Prevents manipulation and coercion
"Choose the response that is most honest and transparent" Reduces deception and hallucination
"Choose the response that would be judged most suitable by a thoughtful senior employee" General safety and appropriateness

Value Alignment and Robustness

Models trained with CAI and RLHF develop several properties that make jailbreaking harder:

Internalized Values

Safety behaviors are embedded in model weights, not just in prompt instructions. The model "understands" why certain content is harmful.

Generalized Refusal

The model can recognize novel harmful requests even if they do not match any specific pattern it was trained on.

Consistent Behavior

CAI-trained models maintain safety behaviors across long conversations and context switches where prompt-based defenses weaken.

Nuanced Judgment

Rather than rigid keyword blocking, the model can assess the intent and context of requests with more nuance.

Combining Model-Level and Application-Level Defenses

The strongest jailbreak prevention combines both approaches:

  • Model-level: CAI/RLHF training provides a strong baseline of safety
  • System prompt: Hardened prompts add application-specific rules
  • Input filtering: Catches known attack patterns before they reach the model
  • Output monitoring: Validates model responses for safety compliance
  • Continuous red teaming: Identifies new vulnerabilities as they emerge
Practical takeaway: When choosing an LLM for a safety-critical application, prioritize models with strong alignment training (like Claude, GPT-4, or Gemini). Then add your own defense layers on top. No single approach is sufficient alone.