Intermediate

Reinforcement Learning from Human Feedback

RLHF is the primary technique used to align large language models with human intentions. Learn how it works, why it is effective, its limitations, and the emerging alternatives.

What is RLHF?

Reinforcement Learning from Human Feedback (RLHF) is a training technique that uses human preferences to guide model behavior. Instead of optimizing a hand-crafted reward function, RLHF learns what humans consider good output by asking them to compare and rank model responses.

The Three Stages of RLHF

Stage 1: Supervised Fine-Tuning (SFT)

Start with a pre-trained language model and fine-tune it on high-quality demonstrations of desired behavior. Human annotators write ideal responses to a variety of prompts, creating a supervised dataset.
Stage 2: Reward Model Training

Generate multiple responses from the SFT model for each prompt. Human annotators rank the responses from best to worst. Train a separate "reward model" to predict human preferences based on these rankings.
Stage 3: RL Optimization (PPO)

Use the reward model as a scoring function and optimize the language model using Proximal Policy Optimization (PPO). The model learns to generate responses that the reward model scores highly, while a KL-divergence penalty prevents it from drifting too far from the SFT model.

RLHF Pipeline Overview

# Simplified RLHF Pipeline

# Stage 1: Supervised Fine-Tuning
sft_model = pretrained_model.fine_tune(demonstration_data)

# Stage 2: Reward Model
# For each prompt, generate multiple responses
# Human ranks: Response A > Response C > Response B
reward_model = train_reward_model(human_preferences)

# Stage 3: RL Optimization
for batch in training_data:
    responses = sft_model.generate(batch.prompts)
    rewards = reward_model.score(responses)
    kl_penalty = kl_divergence(sft_model, current_model)
    total_reward = rewards - beta * kl_penalty
    current_model.update(total_reward)  # PPO update

Strengths of RLHF

Captures nuance: Human preferences encode complex, context-dependent values that are hard to specify explicitly
Reduces harmful outputs: Models trained with RLHF produce significantly fewer toxic, biased, or dangerous responses
Improves helpfulness: RLHF models are better at following instructions and providing useful, relevant answers
Scalable: The reward model can be applied to generate training signal for billions of examples after initial human labeling

Limitations of RLHF

Limitation	Description
Reward hacking	The model may learn to exploit flaws in the reward model rather than genuinely improving quality
Annotator disagreement	Different humans have different preferences; the model may learn an inconsistent average
Sycophancy	Models learn to tell users what they want to hear rather than providing accurate information
Costly labeling	High-quality human preference data is expensive and time-consuming to collect
Mode collapse	Excessive RLHF training can reduce the diversity and creativity of model outputs

Alternatives to RLHF

DPO (Direct Preference Optimization)

Eliminates the need for a separate reward model by directly optimizing the language model using preference pairs. Simpler, more stable, and computationally cheaper than PPO-based RLHF.

RLAIF (RL from AI Feedback)

Uses a more capable AI model to generate preference labels instead of humans. Dramatically reduces cost while maintaining quality for many tasks.

Constitutional AI

Defines a set of principles and trains the model to self-critique and revise responses. Combines RLAIF with explicit value specification.

ORPO / SimPO

Newer techniques that further simplify preference optimization, combining SFT and preference learning into a single training stage.

💡

Key Takeaway: RLHF is not a complete solution to alignment, but it is currently the most effective practical technique. The field is rapidly evolving, and newer methods like DPO and Constitutional AI are addressing many of RLHF's limitations.

← Previous Alignment Problem Next → Red Teaming