Practice Questions & Tips
20 rapid-fire questions to test your knowledge under pressure, whiteboard drawing tips, common mistakes to avoid, and a comprehensive FAQ. Use this as your final review before the interview.
20 Rapid-Fire Questions
Practice answering each in 30-60 seconds. These are the questions interviewers use to quickly assess your depth of knowledge.
Whiteboard Drawing Tips
Strong whiteboard explanations follow a predictable structure. Practice these patterns until they are automatic:
Start With the Data Flow
Always draw input at the top, output at the bottom. Draw the data flowing downward through processing blocks. Label every arrow with the tensor shape: (batch, seq_len, d_model) or (B, C, H, W). Interviewers love seeing that you understand shapes.
Use Boxes for Layers
Draw each major operation as a labeled rectangle: [Conv 3x3, 64], [MultiHead Attn], [LayerNorm], [FFN]. Use color or shading to distinguish different types of operations if possible.
Show Skip Connections Clearly
Draw skip connections as curved arrows on the side that bypass the main path. Label them with "+" to show addition. These are critical for ResNet, Transformer, and U-Net explanations.
Write Key Equations Nearby
After the diagram, write the 1-2 most important equations next to the relevant component. For attention: softmax(QK^T/sqrt(d_k))V. For ResNet: y = F(x) + x. Keep it concise — do not derive from scratch unless asked.
Practice These Architectures
You should be able to draw from memory: a Transformer block (attention + FFN + residuals + norms), a ResNet bottleneck block, an LSTM cell with gates, a U-Net, and a GAN (generator + discriminator with gradient flow).
End With Trade-offs
After explaining an architecture, proactively mention 1-2 trade-offs or alternatives. "This uses O(n^2) attention; for longer sequences we might use sparse attention or Mamba." This demonstrates senior-level thinking.
Common Mistakes to Avoid
- Not asking clarifying questions. When asked "design a model for X," ask about data size, latency requirements, available compute, and success metrics before proposing an architecture.
- Saying "I would use a Transformer" without explaining why. Always compare at least one alternative and explain your reasoning for choosing the Transformer (or any other architecture).
- Forgetting about data. The interviewer asks about architecture but you never mention data preprocessing, augmentation, or potential data issues. Real ML is 80% data.
- Writing pseudocode instead of real code. In coding questions, write actual PyTorch: import statements, correct tensor operations, correct shapes. Test with a print(output.shape) at the end.
- Not knowing what you do not know. Saying "I'm not sure about the exact formulation, but conceptually it works like..." is much better than guessing wrong with confidence.
- Ignoring deployment and efficiency. Mentioning model size, inference latency, and serving cost shows you think beyond just training accuracy.
- Memorizing without understanding. If you cannot explain WHY something works (not just WHAT it does), the interviewer will probe and find the gap.
- Not connecting topics. The best candidates link ideas across areas: "Residual connections in ResNet solve the same vanishing gradient problem that LSTM gates address, but for feedforward networks."
Frequently Asked Questions
Click each question to reveal the answer.
You need comfortable familiarity with: linear algebra (matrix multiplication, eigenvalues, vector norms), calculus (chain rule, partial derivatives, gradients), probability (Bayes' theorem, distributions, KL divergence), and basic optimization (gradient descent, convexity). You do not need to derive everything from scratch, but you should be able to explain the intuition behind formulas like the attention equation, cross-entropy loss, and the reparameterization trick. For senior roles at AI research labs, expect deeper math; for applied ML roles, focus on intuition and implementation.
Unless the interviewer specifically asks you to implement from scratch (e.g., "implement multi-head attention without using nn.MultiheadAttention"), use built-in PyTorch modules. Show that you know the API. However, be prepared to explain what happens inside any module you use. A good rule: use nn.Linear, nn.Conv2d, etc. but be ready to write the forward pass manually if asked. The key implementations to practice from scratch: multi-head attention, a training loop, a ResNet block, an LSTM cell, and a basic GAN.
Be honest: "I have not read that specific paper, but based on the name/context, I believe it addresses [problem]. Here is how I would approach that problem..." Then describe a reasonable approach. Interviewers test your thinking process more than your paper knowledge. Key papers to actually read: Attention Is All You Need (Transformer), ResNet, BERT, GPT-2/3, LoRA, and one diffusion model paper (DDPM or Stable Diffusion). Knowing these six papers deeply covers most questions.
Use the framework: 1) Identify what type of problem it is (classification, generation, sequence modeling, etc.). 2) Connect it to something you know ("This is similar to [known concept] because..."). 3) Reason from first principles. 4) Be honest about the limits of your knowledge. Topics that occasionally appear but are not covered here: graph neural networks, reinforcement learning for fine-tuning (RLHF/DPO), neural architecture search, meta-learning, and federated learning. A surface-level understanding of each is sufficient for most interviews.
Rapid-fire questions: 30-60 seconds. Give a concise, accurate answer. Deep-dive questions: 3-5 minutes. Start with a high-level answer (30 seconds), then go into details. Check in: "Would you like me to go deeper on any part?" Coding questions: narrate as you code. Explain your approach for 1-2 minutes, then code for 15-25 minutes. The biggest mistake is rambling — give a structured answer and let the interviewer steer the depth.
You should know the general architecture (decoder-only Transformer), key training techniques (RLHF, instruction tuning, scaling laws), and high-level capabilities. You do not need to know proprietary details. What matters more: understanding WHY these models work (Transformer architecture, pre-training on web data, emergent capabilities at scale) rather than specific benchmark numbers. For AI startup interviews, deeper knowledge of recent techniques (DPO, constitutional AI, mixture of experts) is expected.
Use a structured framework: 1) Clarify requirements (data, scale, latency, metrics). 2) Data strategy (where does data come from, how much, any issues like class imbalance). 3) Model choice with justification (compare 2-3 options). 4) Training plan (loss function, optimizer, augmentation). 5) Evaluation (offline metrics + online metrics like A/B test). 6) Deployment considerations (latency, model size, monitoring). Practice with: "Design a hate speech detector for a social media platform" or "Build a product recommendation system for an e-commerce site."
PyTorch is the overwhelming default for DL interviews in 2025. It is used in most research, at most AI companies, and is the expected framework unless told otherwise. Google teams may use JAX/Flax. If the job posting mentions TensorFlow, prepare for that. But if no framework is specified, always default to PyTorch. Key APIs to know cold: nn.Module, forward(), nn.Linear, nn.Conv2d, nn.LSTM, nn.MultiheadAttention, optim.AdamW, DataLoader, and the train/eval pattern.
Your Final Checklist
Before your interview, make sure you can:
- Explain why non-linear activations are necessary (and compare ReLU, GELU, Swish)
- Calculate CNN output dimensions given input, kernel, stride, and padding
- Draw an LSTM cell with all four gates and explain the cell state highway
- Implement multi-head attention from scratch in PyTorch (the number one coding question)
- Explain scaling by sqrt(d_k), KV-cache, and why Transformers replaced RNNs
- Compare BERT and GPT (encoder vs decoder, bidirectional vs causal)
- Write a complete training loop with model.eval(), torch.no_grad(), gradient clipping, and LR scheduling
- Explain mixed precision training, gradient accumulation, and LoRA
- Compare GANs, VAEs, and diffusion models with trade-offs for each
- Describe what FID measures and how classifier-free guidance works
- Debug a training run: NaN loss, overfitting, underfitting, dying ReLU
- Draw any architecture on a whiteboard: big picture first, zoom in, add math, discuss trade-offs
Lilly Tech Systems