Reinforcement Learning from Human Feedback
RLHF is the primary technique used to align large language models with human intentions. Learn how it works, why it is effective, its limitations, and the emerging alternatives.
What is RLHF?
Reinforcement Learning from Human Feedback (RLHF) is a training technique that uses human preferences to guide model behavior. Instead of optimizing a hand-crafted reward function, RLHF learns what humans consider good output by asking them to compare and rank model responses.
The Three Stages of RLHF
-
Stage 1: Supervised Fine-Tuning (SFT)
Start with a pre-trained language model and fine-tune it on high-quality demonstrations of desired behavior. Human annotators write ideal responses to a variety of prompts, creating a supervised dataset.
-
Stage 2: Reward Model Training
Generate multiple responses from the SFT model for each prompt. Human annotators rank the responses from best to worst. Train a separate "reward model" to predict human preferences based on these rankings.
-
Stage 3: RL Optimization (PPO)
Use the reward model as a scoring function and optimize the language model using Proximal Policy Optimization (PPO). The model learns to generate responses that the reward model scores highly, while a KL-divergence penalty prevents it from drifting too far from the SFT model.
# Simplified RLHF Pipeline # Stage 1: Supervised Fine-Tuning sft_model = pretrained_model.fine_tune(demonstration_data) # Stage 2: Reward Model # For each prompt, generate multiple responses # Human ranks: Response A > Response C > Response B reward_model = train_reward_model(human_preferences) # Stage 3: RL Optimization for batch in training_data: responses = sft_model.generate(batch.prompts) rewards = reward_model.score(responses) kl_penalty = kl_divergence(sft_model, current_model) total_reward = rewards - beta * kl_penalty current_model.update(total_reward) # PPO update
Strengths of RLHF
- Captures nuance: Human preferences encode complex, context-dependent values that are hard to specify explicitly
- Reduces harmful outputs: Models trained with RLHF produce significantly fewer toxic, biased, or dangerous responses
- Improves helpfulness: RLHF models are better at following instructions and providing useful, relevant answers
- Scalable: The reward model can be applied to generate training signal for billions of examples after initial human labeling
Limitations of RLHF
| Limitation | Description |
|---|---|
| Reward hacking | The model may learn to exploit flaws in the reward model rather than genuinely improving quality |
| Annotator disagreement | Different humans have different preferences; the model may learn an inconsistent average |
| Sycophancy | Models learn to tell users what they want to hear rather than providing accurate information |
| Costly labeling | High-quality human preference data is expensive and time-consuming to collect |
| Mode collapse | Excessive RLHF training can reduce the diversity and creativity of model outputs |
Alternatives to RLHF
DPO (Direct Preference Optimization)
Eliminates the need for a separate reward model by directly optimizing the language model using preference pairs. Simpler, more stable, and computationally cheaper than PPO-based RLHF.
RLAIF (RL from AI Feedback)
Uses a more capable AI model to generate preference labels instead of humans. Dramatically reduces cost while maintaining quality for many tasks.
Constitutional AI
Defines a set of principles and trains the model to self-critique and revise responses. Combines RLAIF with explicit value specification.
ORPO / SimPO
Newer techniques that further simplify preference optimization, combining SFT and preference learning into a single training stage.
Lilly Tech Systems