Intermediate

Language Models Interview Questions

These 15 questions cover the transformer-based language models that dominate modern NLP. Expect at least 3–5 of these in any NLP interview. Know the architectures, training objectives, and when to use each model family.

Q1: Explain the transformer architecture. What are the key components?

💡
Model Answer:

The transformer (Vaswani et al., 2017) replaced RNNs/LSTMs with a fully attention-based architecture. Key components:

  1. Multi-Head Self-Attention: Each token attends to every other token. Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V. Multiple heads (8–128) let the model attend to different relationship types simultaneously.
  2. Feed-Forward Network (FFN): Two linear layers with a nonlinearity (ReLU or GELU) in between. Applied independently to each position. Typically 4x the hidden dimension (768 to 3072 in BERT-base).
  3. Layer Normalization: Applied before (pre-norm, modern) or after (post-norm, original) each sublayer. Stabilizes training.
  4. Residual Connections: output = LayerNorm(x + Sublayer(x)). Enables gradient flow through deep networks.
  5. Positional Encoding: Sinusoidal (original) or learned (BERT). Without this, the model has no notion of token order since attention is permutation-invariant.

Why it works: Self-attention has O(1) path length between any two tokens (vs O(n) for RNNs), enabling better long-range dependency learning. The tradeoff is O(n^2) memory for the attention matrix.

Q2: What is the difference between encoder-only, decoder-only, and encoder-decoder models?

💡
Model Answer:
ArchitectureExamplesAttention TypeBest For
Encoder-onlyBERT, RoBERTa, DeBERTaBidirectional (sees all tokens)Classification, NER, extractive QA, sentence embeddings
Decoder-onlyGPT-2/3/4, LLaMA, MistralCausal (sees only previous tokens)Text generation, completion, chat, general-purpose LLMs
Encoder-decoderT5, BART, mBARTBidirectional encoder + causal decoder with cross-attentionTranslation, summarization, seq2seq tasks

Key insight: Decoder-only models have dominated since GPT-3 because they scale better and can be prompted for any task. Encoder-only models are still preferred when you need bidirectional representations (e.g., semantic search, classification) because they see the full context in both directions.

Why decoder-only won for LLMs: Training is simpler (next token prediction), scales predictably with compute (scaling laws), and a single model can do many tasks via prompting without task-specific heads.

Q3: Explain masked language modeling (MLM) vs causal language modeling (CLM).

💡
Model Answer:

Masked Language Modeling (BERT):

  • Randomly mask 15% of input tokens. Of masked tokens: 80% replaced with [MASK], 10% with a random token, 10% kept unchanged.
  • Model predicts the original token at each masked position using bidirectional context.
  • Loss: cross-entropy only on masked positions (not all tokens).
  • Advantage: Bidirectional context produces better representations for understanding tasks.
  • Disadvantage: Cannot generate text autoregressively. The [MASK] token creates a train/test mismatch.

Causal Language Modeling (GPT):

  • Predict the next token given all previous tokens. No masking — just left-to-right prediction.
  • Loss: cross-entropy on every token position (more training signal per example).
  • Advantage: Natural fit for generation. No train/test mismatch. Scales better.
  • Disadvantage: Unidirectional — cannot see future tokens, which hurts understanding tasks.

T5's approach: Uses a "span corruption" objective — randomly masks contiguous spans and the decoder generates the missing spans. Combines benefits of both approaches for seq2seq tasks.

Q4: How does BERT work? Explain pre-training and fine-tuning.

💡
Model Answer:

BERT (Bidirectional Encoder Representations from Transformers, Devlin et al., 2019):

Pre-training (unsupervised, on large corpus):

  1. MLM: Mask 15% of tokens, predict originals using bidirectional context
  2. NSP (Next Sentence Prediction): Given two sentences, predict if sentence B follows sentence A. (Later shown to be unnecessary — RoBERTa removes it.)

Architecture: BERT-base: 12 layers, 768 hidden, 12 heads, 110M params. BERT-large: 24 layers, 1024 hidden, 16 heads, 340M params.

Fine-tuning (supervised, on task-specific data):

  • Classification: Add a linear layer on top of the [CLS] token embedding. Fine-tune all parameters with a small learning rate (2e-5) for 3–5 epochs.
  • NER: Add a linear layer on each token's output. Predict entity tags per token.
  • QA: Add two linear layers to predict start and end positions of the answer span.

Why BERT matters: It proved that pre-training on unlabeled text + fine-tuning on small labeled datasets outperforms training from scratch. This transfer learning paradigm now dominates NLP.

Q5: What improvements does RoBERTa make over BERT?

💡
Model Answer:

RoBERTa (Liu et al., 2019) showed that BERT was significantly undertrained. Key changes:

  1. Remove NSP: Next Sentence Prediction hurt performance. Use single sentences or full documents instead of sentence pairs.
  2. Dynamic masking: BERT used static masks (same masks every epoch). RoBERTa generates new random masks each time a sequence is fed to the model.
  3. More data: Trained on 160GB of text (vs BERT's 16GB). Added CC-News, OpenWebText, Stories datasets.
  4. Larger batches: 8K batch size (vs BERT's 256). Larger batches improve optimization for MLM.
  5. Longer training: 500K steps (vs BERT's 100K steps).
  6. BPE tokenizer: Uses byte-level BPE (50K vocab) instead of WordPiece (30K vocab).

Result: Same architecture as BERT, but consistently outperforms it on all benchmarks. The lesson: training methodology and data matter as much as architecture.

Q6: How does GPT differ from BERT architecturally?

💡
Model Answer:
AspectBERTGPT
ArchitectureEncoder-only (bidirectional)Decoder-only (causal/autoregressive)
Attention maskFull attention (every token sees every other)Causal mask (each token only sees previous tokens)
Pre-trainingMasked LM + NSPNext token prediction (causal LM)
GenerationCannot generate text nativelyGenerates text autoregressively
Fine-tuningAdd task-specific head, fine-tune all paramsFew-shot prompting or instruction tuning
Best forUnderstanding (classification, NER, search)Generation (chat, completion, reasoning)

Scaling trajectory: GPT-1 (117M) to GPT-2 (1.5B) to GPT-3 (175B) to GPT-4 (estimated ~1.8T MoE). Each jump showed emergent capabilities that smaller models lacked (in-context learning, chain-of-thought reasoning).

Q7: What is T5 and how does it unify NLP tasks?

💡
Model Answer:

T5 (Text-to-Text Transfer Transformer, Raffel et al., 2020) frames every NLP task as a text-to-text problem:

  • Classification: Input: "classify: This movie is great" → Output: "positive"
  • Translation: Input: "translate English to French: Hello" → Output: "Bonjour"
  • Summarization: Input: "summarize: [long text]" → Output: "[summary]"
  • QA: Input: "question: Who won? context: [text]" → Output: "Team A"

Architecture: Encoder-decoder transformer. The encoder processes the input bidirectionally, and the decoder generates the output autoregressively with cross-attention to the encoder.

Pre-training: Span corruption objective — randomly mask contiguous spans (average length 3 tokens) and train the decoder to generate the missing spans with sentinel tokens.

Why it matters: A single model architecture, loss function, and hyperparameter set works for all tasks. This simplification influenced later models and made multi-task learning straightforward. Flan-T5 added instruction tuning on 1,800+ tasks, making it one of the best open-source models for its size.

Q8: Explain the self-attention mechanism step by step. Why scale by sqrt(d_k)?

💡
Model Answer:

Step-by-step computation for a single head:

  1. Project input X into three matrices: Q = X * W_Q, K = X * W_K, V = X * W_V (each projection is a learned linear layer)
  2. Compute attention scores: scores = Q * K^T (dot product of each query with all keys, shape: [seq_len, seq_len])
  3. Scale: scores = scores / sqrt(d_k), where d_k is the dimension of keys
  4. Apply mask (if causal): set future positions to -infinity
  5. Apply softmax: attention_weights = softmax(scores) (each row sums to 1)
  6. Compute output: output = attention_weights * V (weighted sum of value vectors)

Why scale by sqrt(d_k)? Without scaling, dot products grow proportionally with d_k (if Q and K elements are unit variance, their dot product has variance d_k). Large dot products push softmax into regions with extremely small gradients (saturation). Dividing by sqrt(d_k) keeps the variance at 1, ensuring softmax produces useful gradients.

Multi-head: Run h parallel attention heads with smaller d_k = d_model / h, concatenate outputs, and project through W_O. This lets each head specialize in different attention patterns (syntactic, semantic, positional).

Q9: What are positional encodings? Compare sinusoidal, learned, and RoPE.

💡
Model Answer:

Transformers have no built-in notion of token order (attention is permutation-invariant). Positional encodings inject position information.

TypeHow It WorksUsed ByPros/Cons
SinusoidalFixed sine/cosine functions of different frequencies added to embeddingsOriginal TransformerNo learned params, can theoretically extrapolate to longer sequences, but rarely does in practice
LearnedTrainable embedding table (one vector per position)BERT, GPT-2Learns task-specific patterns, but hard-limited to training length (512 for BERT)
RoPERotary Position Embedding: rotates Q and K vectors by position-dependent anglesLLaMA, Mistral, QwenEncodes relative positions naturally, works well with context extension techniques (NTK-aware scaling, YaRN)
ALiBiAdds a linear bias to attention scores based on distance between tokensBLOOM, MPTNo extra parameters, good length extrapolation, simple to implement

Key trend: Modern LLMs predominantly use RoPE because it enables effective context length extension via interpolation techniques, allowing models trained on 4K contexts to work at 128K+ with minimal fine-tuning.

Q10: What is fine-tuning? Compare full fine-tuning, feature extraction, and PEFT methods.

💡
Model Answer:

Fine-tuning adapts a pre-trained model to a specific downstream task. Three approaches:

Full Fine-Tuning

Update all model parameters on task-specific data. Best accuracy but requires storing a separate copy of the full model per task. For a 7B model: ~28 GB per task checkpoint.

Feature Extraction (Frozen)

Freeze all pretrained weights. Only train a new classification head on the extracted features. Fast and cheap but lower accuracy because representations are not adapted to the task.

PEFT (Parameter-Efficient)

Freeze most parameters and train a small number of additional or modified parameters. Achieves 90–99% of full fine-tuning performance at a fraction of the cost. LoRA, QLoRA, adapters, prefix tuning.

Decision framework: Use full fine-tuning for small models (<1B params) or when you need maximum accuracy. Use PEFT for large models (7B+) or when you need multiple task-specific models efficiently. Use feature extraction for quick prototyping or when compute is extremely limited.

Q11: Explain LoRA. How does it work and why is it so popular?

💡
Model Answer:

LoRA (Low-Rank Adaptation, Hu et al., 2021) is the most widely used PEFT method. Core idea:

Instead of updating the full weight matrix W (d x d), LoRA freezes W and adds a low-rank decomposition: W' = W + BA, where B is (d x r) and A is (r x d), and r << d (typically r = 8, 16, or 64).

Why it works:

  • The weight updates during fine-tuning have low intrinsic rank — you do not need to update all d*d parameters
  • Only r * d * 2 parameters are trained per layer (e.g., for d=4096 and r=16: 131K vs 16.7M parameters per layer)
  • At inference, merge BA into W: no additional latency (W_merged = W + BA)
  • Multiple LoRA adapters can be swapped at runtime for different tasks on the same base model

Typical configuration: Apply LoRA to Q and V projection matrices in attention layers. Rank r=16, alpha=32. This trains ~0.1–1% of total parameters.

QLoRA (Dettmers et al., 2023) extends LoRA by quantizing the base model to 4-bit (NF4 data type) while keeping LoRA adapters in 16-bit. This lets you fine-tune a 65B parameter model on a single 48GB GPU. Key innovations: 4-bit NormalFloat, double quantization, and paged optimizers.

Q12: What is the difference between pre-training, fine-tuning, and instruction tuning?

💡
Model Answer:
StageDataObjectiveResult
Pre-trainingTrillions of tokens (web crawl, books, code)Next token prediction (CLM) or masked LMBase model that understands language but is not helpful (it just completes text)
Supervised Fine-Tuning (SFT)10K–100K instruction-response pairsNext token prediction on responsesModel follows instructions but may produce harmful or low-quality outputs
RLHF / DPOHuman preference data (chosen vs rejected pairs)Maximize reward model score / preference alignmentModel produces helpful, harmless, and honest responses

Instruction tuning specifically refers to SFT on a diverse set of tasks formatted as instructions. Flan (Google), InstructGPT (OpenAI), and Alpaca (Stanford) are examples. The key insight is that training on diverse instructions with consistent formatting teaches the model to follow novel instructions at inference time (zero-shot generalization).

The full pipeline for ChatGPT-like models: Pre-train (months, millions of dollars) → SFT (days) → RLHF (days) → Red-teaming & safety filters (ongoing).

Q13: What are scaling laws? Why do they matter?

💡
Model Answer:

Scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) describe how model performance improves predictably as you increase model size (N), dataset size (D), and compute budget (C).

Key findings:

  • Loss follows a power law: L(N) ≈ (N_c / N)^alpha, where alpha ≈ 0.076 for language models
  • Chinchilla scaling (Hoffmann 2022): For a given compute budget, optimal performance comes from training a smaller model on more data. The rule of thumb is 20 tokens per parameter (a 7B model should see 140B tokens).
  • This overturned the GPT-3 approach of training very large models on relatively less data
  • LLaMA (7B trained on 1T tokens) matched GPT-3 (175B trained on 300B tokens) by following Chinchilla scaling

Why it matters for interviews: Shows you understand the economics and planning behind LLM development. When asked "how would you train a model for X?", you should consider the compute-optimal model size for your budget, not just pick the largest model possible.

Q14: What is the KV cache? Why is it important for inference?

💡
Model Answer:

During autoregressive generation, the model generates one token at a time. Without caching, generating token n requires recomputing attention for all n-1 previous tokens — O(n^2) total work for generating n tokens.

KV cache solution: Cache the Key and Value projections for all previous tokens. When generating token n, only compute Q for the new token and attend to the cached K and V matrices. This reduces per-token work from O(n * d) to O(d) for the projection, though attention itself is still O(n) per token.

Memory cost: For each layer, store K and V of shape [seq_len, d_head * n_heads]. For a 7B model with 32 layers, 32 heads, d_head=128: KV cache per token = 2 * 32 * 32 * 128 * 2 bytes (fp16) = 512 KB. For 4K context: ~2 GB of GPU memory just for KV cache.

Optimizations:

  • Multi-Query Attention (MQA): Share K and V across all heads. Reduces KV cache by n_heads times.
  • Grouped-Query Attention (GQA): Share K and V across groups of heads (e.g., 8 KV heads for 32 query heads). Used by LLaMA 2 70B, Mistral.
  • PagedAttention (vLLM): Manage KV cache in non-contiguous memory pages, reducing waste from padding and enabling efficient batch scheduling.

Q15: Compare DeBERTa, ELECTRA, and RoBERTa. When would you choose each?

💡
Model Answer:
ModelKey InnovationBest For
RoBERTaBetter training recipe for BERT (more data, dynamic masking, no NSP, larger batches)General-purpose encoder. Strong default choice when you need a BERT-like model.
DeBERTaDisentangled attention (separates content and position) + enhanced mask decoder. Uses relative position encoding.Best encoder-only model for NLU benchmarks. Especially strong on tasks requiring position-sensitive understanding.
ELECTRAReplaced token detection instead of MLM. A generator corrupts tokens, and the discriminator predicts which tokens are real vs replaced. Trains on ALL tokens (not just 15%).Most sample-efficient encoder. Achieves BERT-level performance with 1/4 the compute. Great for low-resource scenarios.

Decision framework:

  • Default choice for production: DeBERTa-v3 (best accuracy)
  • Limited compute or data: ELECTRA (most efficient training)
  • Need extensive ecosystem/community support: RoBERTa (most widely used)
  • Need generation capability: None of these — use a decoder model

Key Takeaways

💡
  • Know the three architecture families (encoder-only, decoder-only, encoder-decoder) and when to use each
  • Be able to explain self-attention step by step, including the sqrt(d_k) scaling reason
  • Understand the pre-training → SFT → RLHF pipeline for modern LLMs
  • LoRA/QLoRA are must-know topics — explain the low-rank decomposition and why it works
  • Chinchilla scaling laws changed how the industry thinks about model training: smaller models + more data
  • KV cache, GQA, and PagedAttention are critical for production inference optimization