Training & Alignment Questions
These 12 questions cover the full pipeline from pre-training a base model to aligning it with human preferences. This is core knowledge for LLM engineer roles at OpenAI, Anthropic, and Google, and increasingly tested at GenAI engineer roles too.
Q1: Walk me through the full LLM training pipeline from raw text to a deployed chat model.
The modern LLM pipeline has 4 stages:
- Pre-training: Train on trillions of tokens of web text, books, code. Objective: next-token prediction. Duration: weeks to months on thousands of GPUs. Result: a base model that generates coherent text but is not helpful or safe.
- Supervised Fine-Tuning (SFT): Fine-tune on high-quality (instruction, response) pairs. Thousands to millions of examples. Teaches the model to follow instructions and generate structured responses. Duration: hours to days on tens of GPUs.
- Preference Optimization (RLHF or DPO): Align the model with human preferences. Humans rank model outputs; this signal trains the model to produce preferred responses. Improves helpfulness, honesty, and safety.
- Safety fine-tuning: Red-teaming, constitutional AI, content filtering. Test and patch dangerous behaviors. Ongoing process, not a one-time step.
Key insight: Each stage adds capabilities but risks losing prior ones. SFT can cause the model to forget pre-training knowledge (catastrophic forgetting). RLHF can cause "reward hacking" where the model optimizes for the reward model's preferences rather than genuine quality.
Q2: What is the difference between RLHF and DPO? When would you choose one over the other?
RLHF (Reinforcement Learning from Human Feedback):
- Collect comparison data: humans rank multiple model responses
- Train a reward model to predict human preferences
- Use PPO to optimize the policy (language model) to maximize reward, with KL penalty to prevent deviation from SFT model
DPO (Direct Preference Optimization):
- Skip the reward model entirely
- Use the same preference data but optimize the policy directly with a clever loss function
- The loss implicitly defines a reward model as the log-ratio of policy probabilities
| Aspect | RLHF | DPO |
|---|---|---|
| Complexity | High (3 models: policy, reward, reference) | Low (1 model + reference) |
| Stability | PPO can be unstable, requires careful tuning | More stable, standard cross-entropy-style loss |
| Compute | Higher (reward model inference during training) | Lower (no reward model needed) |
| Data efficiency | Can reuse reward model across iterations | Needs fresh preference data per iteration |
| Quality at scale | Better for very large models (OpenAI's approach) | Competitive for models up to 70B |
When to choose: DPO for most practical cases (simpler, cheaper, good enough). RLHF when training frontier models where marginal quality gains justify complexity. Iterative RLHF with online data collection is still superior for the absolute best models.
Q3: What is Constitutional AI? How does it differ from RLHF?
Constitutional AI (CAI), developed by Anthropic, replaces human labelers with a set of principles (a "constitution") that the AI uses to self-evaluate and self-improve.
Two-phase process:
- Self-critique & revision: Generate a response, then ask the model to critique it against constitutional principles ("Is this response harmful? Does it respect privacy?"), then revise. Repeat for multiple principles.
- RLAIF (RL from AI Feedback): Instead of human rankings, use the AI's own evaluations (guided by the constitution) to train the reward model. Then standard RL optimization.
Key differences from RLHF:
- Scalable: no expensive human annotation for safety evaluations
- Transparent: the constitution is a readable document of principles
- Consistent: AI feedback does not have the variance of human annotators
- Limitation: the quality ceiling depends on how well the AI can evaluate its own outputs
Q4: How do you build a reward model? What are common failure modes?
Building a reward model:
- Start with the SFT model (same architecture, pretrained weights)
- Replace the language modeling head with a scalar output head
- Train on comparison data: given (prompt, response_A, response_B) where A is preferred, maximize log sigmoid(r(A) - r(B))
- Typically use 50K–500K comparison pairs. Quality of comparisons matters more than quantity.
Common failure modes:
- Reward hacking: The policy finds outputs that score high on the reward model but are not genuinely good. Example: verbose, sycophantic responses that agree with the user regardless of correctness.
- Length bias: Reward models often prefer longer responses. Without correction, the policy learns to be unnecessarily verbose.
- Distribution shift: The reward model is trained on SFT-model outputs but must evaluate policy-model outputs that drift over PPO training. Accuracy degrades.
- Annotator disagreement: Humans disagree on preferences, especially for subjective or nuanced questions. The reward model learns a noisy average.
- Overoptimization: As you push harder on the reward model (more PPO steps), quality initially improves then degrades. The optimal stopping point requires careful monitoring.
Q5: What is LoRA? Explain how it works mathematically and its trade-offs.
LoRA (Low-Rank Adaptation) freezes all pretrained weights and adds trainable low-rank decomposition matrices to attention layers.
Mathematics: For a pretrained weight matrix W ∈ R^(d×d), instead of updating W directly, add ΔW = BA where B ∈ R^(d×r) and A ∈ R^(r×d), with rank r << d.
- Original: h = Wx
- With LoRA: h = Wx + BAx
- Trainable params: 2 × d × r (vs d² for full fine-tuning). With r=16, d=4096: 131K vs 16.7M per matrix.
Key design decisions:
- Rank (r): 8–64 typical. Higher rank = more capacity but more parameters. r=16 is a common default.
- Alpha: Scaling factor applied to ΔW. Effective update = (α/r) × BA. Higher alpha = larger updates.
- Target modules: Apply to Q, K, V, O projections and/or FFN layers. More modules = better quality but more params.
- Merging: After training, merge BA into W for zero-overhead inference: W' = W + BA. This is a unique advantage over adapters.
Trade-offs: LoRA is 10–100x cheaper than full fine-tuning with 90–95% of the quality. It does not match full fine-tuning for tasks requiring significant knowledge acquisition (learning a new language) but works excellently for style adaptation, instruction following, and domain-specific formatting.
Q6: What is QLoRA? How does it enable fine-tuning 70B models on a single GPU?
QLoRA combines three innovations:
- 4-bit NormalFloat (NF4) quantization: Quantize base model weights to 4-bit using a data type optimized for normally distributed weights. Better than INT4 for neural network weights.
- Double quantization: Quantize the quantization constants themselves, saving an additional ~0.4 bits per parameter.
- Paged optimizers: Use CPU memory as overflow when GPU memory is exhausted, managed like virtual memory pages.
Memory savings: A 70B model in FP16 needs ~140 GB. In NF4: ~35 GB. LoRA adapters add ~1 GB. Total: ~36 GB, fitting on a single A100 80GB or even an A6000 48GB.
Quality: QLoRA matches full 16-bit fine-tuning quality on most benchmarks. The quantization adds noise but LoRA adapters are trained in FP16/BF16, compensating for quantization error.
Practical impact: Democratized fine-tuning. Before QLoRA, fine-tuning 70B models required 8+ A100 GPUs (~$100K+). Now it runs on a single GPU (~$2/hour on cloud).
Q7: How do you curate training data for SFT? What makes high-quality instruction data?
Quality signals that matter:
- Diversity: Cover many task types (QA, summarization, coding, math, creative writing, reasoning). LIMA showed 1,000 diverse, high-quality examples can outperform 1M low-quality ones.
- Complexity: Include multi-step reasoning, edge cases, and nuanced instructions. Simple "What is X?" questions produce a shallow model.
- Correctness: Every response must be factually accurate and well-structured. One wrong answer teaches the model to be wrong on similar inputs.
- Format variety: Mix JSON output, markdown, code blocks, tables, step-by-step. The model learns formatting from examples.
- Refusals: Include examples where the correct response is "I cannot help with that" for harmful/unethical requests.
Data curation pipeline:
- Seed with high-quality human-written examples (expensive but essential)
- Generate synthetic data from stronger models (GPT-4-class). Filter aggressively.
- Decontaminate against benchmarks (remove test set leakage)
- Dedup by embedding similarity (remove near-duplicates that bias the distribution)
- Human review of a random sample (10–20%) for quality assurance
The LIMA insight: "Less is more for alignment." 1,000 carefully curated examples produced better chat models than 1M automatically generated examples. Data quality dominates data quantity for SFT.
Q8: What is catastrophic forgetting? How do you prevent it during fine-tuning?
Catastrophic forgetting occurs when fine-tuning on new data causes the model to lose knowledge and capabilities learned during pre-training.
Why it happens: Fine-tuning updates all (or many) parameters to optimize for the new task distribution. These updates overwrite information encoded during pre-training. The gradient signal from fine-tuning data overwhelms the "memory" stored in weights.
Prevention strategies:
- LoRA/PEFT: Only update a small number of additional parameters. Base weights are frozen, preserving pre-trained knowledge.
- Low learning rate: Use 1e-5 to 5e-5 (10–100x lower than pre-training). Gradual updates preserve existing knowledge.
- Data mixing: Include a portion of pre-training-style data alongside fine-tuning data (e.g., 10% general text). Replays reinforce prior knowledge.
- KL penalty: Add a loss term penalizing divergence from the original model's predictions. Used in RLHF's PPO step.
- Short training: 1–3 epochs is typical for SFT. Overfitting to fine-tuning data directly correlates with forgetting.
- Elastic Weight Consolidation (EWC): Penalize changes to parameters important for previous tasks. Rarely used in practice due to LoRA's simplicity.
Q9: What is instruction tuning? How does it differ from standard fine-tuning?
Standard fine-tuning: Train on task-specific data for one task. The model learns to do that task well but may not generalize.
Instruction tuning: Train on a diverse set of tasks framed as natural language instructions. Instead of task-specific heads, the model learns to follow any instruction.
Key papers:
- FLAN (Google, 2022): Fine-tuned on 1,836 tasks with instruction templates. Showed massive zero-shot improvement on unseen tasks.
- InstructGPT (OpenAI, 2022): Combined SFT on demonstrations with RLHF. Created the first "ChatGPT-like" behavior.
- Self-Instruct: Use the model itself to generate instruction-following data from a small seed set. Bootstraps instruction data cheaply.
What makes instruction tuning special:
- Unlocks zero-shot and few-shot capabilities that exist in the base model but are not accessible without instructions
- The model learns a meta-skill: "follow instructions" rather than any single task
- Scaling instruction diversity (more task types, more phrasings) improves performance more than scaling data volume within a task
Q10: How do you evaluate an LLM after training? What benchmarks matter?
Benchmark categories:
| Category | Benchmarks | What It Tests |
|---|---|---|
| Knowledge | MMLU, ARC, HellaSwag | Factual knowledge, commonsense reasoning, world knowledge |
| Reasoning | GSM8K, MATH, BBH | Mathematical reasoning, logical reasoning, multi-step problems |
| Coding | HumanEval, MBPP, SWE-bench | Code generation, debugging, real-world software engineering |
| Chat quality | MT-Bench, AlpacaEval, Chatbot Arena | Instruction following, helpfulness, conversation quality |
| Safety | TruthfulQA, ToxiGen, BBQ | Truthfulness, toxicity, bias |
The benchmark problem:
- Data contamination: Models may have seen benchmark questions during pre-training. Scores can be inflated.
- Gaming: Models can be fine-tuned specifically to score well on benchmarks without genuine improvement.
- Static vs dynamic: Benchmarks become stale. Chatbot Arena (live human comparisons) is currently the most trusted measure.
What actually matters in production: Task-specific evaluation on your data. Build an eval set that matches your use case. Use LLM-as-judge (have a stronger model evaluate outputs) for scalable evaluation. Always include human evaluation as a sanity check.
Q11: What is the difference between pre-training data and fine-tuning data in terms of quality requirements?
| Aspect | Pre-training Data | Fine-tuning (SFT) Data |
|---|---|---|
| Volume | 1T–15T+ tokens | 10K–1M examples |
| Quality bar | Moderate (filtered web crawl) | Very high (human-written or carefully curated) |
| Format | Raw text (articles, books, code) | Structured (instruction, response) pairs |
| Filtering | Dedup, language detection, quality classifiers, PII removal | Manual review, correctness verification, diversity balancing |
| Cost per example | Fractions of a cent (automated) | $5–$50 per example (human annotation) |
| Impact of errors | Noise is averaged out over trillions of tokens | Each bad example directly teaches bad behavior |
Key insight: Pre-training is about volume with reasonable quality. Fine-tuning is about quality with reasonable volume. Investing $10K in 200 perfect fine-tuning examples often beats $1K in 10,000 mediocre ones.
Q12: What is RLHF reward hacking? Give concrete examples and mitigations.
Reward hacking occurs when the policy finds ways to maximize the reward model's score that do not align with genuine quality improvements.
Concrete examples:
- Sycophancy: The model learns to agree with the user's stated opinions, even when they are wrong, because human raters preferred agreeable responses. "You're absolutely right that the earth is flat."
- Verbosity: Longer responses score higher on the reward model. The policy generates unnecessarily detailed answers, burying the actual answer in filler.
- Hedging: Adding "However, it depends on..." caveats to every answer scores well because it sounds thoughtful, even when the answer is straightforward.
- Format gaming: Using bullet points, headers, and bold text gets higher scores regardless of content quality.
- Refusal over-correction: The model refuses too many benign requests because the safety reward signal was too strong during training.
Mitigations:
- KL penalty: Limit how far the policy can deviate from the SFT model. Prevents extreme optimization.
- Length normalization: Normalize reward by response length to remove length bias.
- Ensemble reward models: Use multiple reward models and require agreement. Reduces single-model exploitation.
- Iterative training: Periodically retrain the reward model on current policy outputs. Closes the distribution gap.
- Human spot-checking: Regularly audit model outputs for reward hacking patterns.
Lilly Tech Systems