Advanced

Practice Questions & Tips

20 rapid-fire questions to test your knowledge under pressure, whiteboard drawing tips, common mistakes to avoid, and a comprehensive FAQ. Use this as your final review before the interview.

20 Rapid-Fire Questions

Practice answering each in 30-60 seconds. These are the questions interviewers use to quickly assess your depth of knowledge.

1. What does the softmax function do and when do you use it?

Converts a vector of real numbers into a probability distribution that sums to 1. Used as the final activation for multi-class classification. softmax(z_i) = exp(z_i) / sum(exp(z_j)). For binary classification, use sigmoid instead.

2. Why is ReLU preferred over sigmoid for hidden layers?

ReLU (max(0, x)) avoids the vanishing gradient problem — its gradient is 1 for positive inputs vs. sigmoid's max gradient of 0.25. ReLU is also computationally cheaper (no exp). Sigmoid squashes all values to [0,1], saturating for large/small inputs and causing near-zero gradients.

3. What is the difference between a parameter and a hyperparameter?

Parameters are learned during training (weights, biases). Hyperparameters are set before training and control the learning process (learning rate, batch size, number of layers, dropout rate). Hyperparameters are tuned via grid search, random search, or Bayesian optimization.

4. What is an embedding layer and why is it used?

An embedding layer maps discrete tokens (words, IDs) to dense continuous vectors. It is a learnable lookup table — equivalent to a one-hot encoding followed by a linear layer, but implemented as an index lookup for efficiency. Used for words, categories, user/item IDs.

5. What is the purpose of the bias term in a neural network?

The bias allows the activation function to be shifted left or right, independent of the input. Without bias, the output of a layer is always zero when all inputs are zero. Bias enables the model to fit data that does not pass through the origin.

6. What is the difference between epoch, batch, and iteration?

Epoch: one complete pass through the entire training dataset. Batch (mini-batch): a subset of the dataset processed together in one forward/backward pass. Iteration: one parameter update step = one batch processed. If dataset has 10,000 samples and batch_size=100, one epoch = 100 iterations.

7. What is transfer learning in one sentence?

Using a model pre-trained on a large dataset (e.g., ImageNet, large text corpus) as a starting point for a different but related task, leveraging learned general features to achieve better performance with less data and training time.

8. Why does batch size affect training?

Larger batches give more accurate gradient estimates (less noise) but may converge to sharp minima that generalize poorly. Smaller batches have noisier gradients that act as regularization, often finding flatter minima. Batch size also affects learning rate choice — linear scaling rule: if you double batch size, double learning rate.

9. What is the difference between generative and discriminative models?

Discriminative models learn p(y|x) — the decision boundary between classes (logistic regression, most classifiers). Generative models learn p(x) or p(x|y) — the full data distribution (GANs, VAEs, GPT). Generative models can create new data; discriminative models can only classify existing data.

10. What is attention in one sentence?

Attention is a mechanism that computes a weighted sum of values where the weights are determined by the similarity (dot product) between a query and corresponding keys, allowing the model to dynamically focus on the most relevant parts of the input.

11. What is gradient descent and how does it work?

An optimization algorithm that iteratively adjusts parameters in the direction of steepest descent of the loss function. Update rule: w = w - lr * dL/dw. "Stochastic" gradient descent uses a random subset (mini-batch) of data per step instead of the full dataset.

12. What is a loss function and name three common ones?

A loss function measures how wrong the model's predictions are. 1) Cross-entropy loss: multi-class classification. 2) Binary cross-entropy: binary classification. 3) MSE (Mean Squared Error): regression. The choice depends on the task — cross-entropy for classification, MSE/MAE for regression.

13. What is the Universal Approximation Theorem?

A feedforward neural network with a single hidden layer containing a finite number of neurons and a non-linear activation function can approximate any continuous function on a compact subset of R^n to arbitrary precision. Note: it does not say how many neurons you need or that gradient descent will find the solution.

14. What is data leakage and how do you prevent it?

Data leakage occurs when information from the test/validation set influences training. Examples: normalizing before train/test split (statistics include test data), using future data to predict the past, target encoding with test data. Prevention: always split data first, then preprocess training set and apply the same transformation to test set.

15. Explain the concept of receptive field in CNNs.

The region of the input image that affects a particular neuron's output. Grows with network depth. Two stacked 3x3 convs have a 5x5 receptive field (same as one 5x5 conv but fewer parameters: 18 vs 25). Pooling and stride increase receptive field multiplicatively.

16. What is the difference between overfitting and underfitting?

Overfitting: model memorizes training data, low train loss but high validation loss. Fix: more data, regularization, smaller model. Underfitting: model cannot capture patterns, high train and validation loss. Fix: larger model, train longer, reduce regularization, better features.

17. Why do we normalize input data before feeding it to a neural network?

Normalization ensures all features are on a similar scale, preventing features with larger magnitudes from dominating. It helps gradient descent converge faster (more spherical loss landscape) and prevents numerical issues. Common: zero mean, unit variance (StandardScaler) or min-max to [0, 1].

18. What is model.eval() in PyTorch and why is it important?

model.eval() switches the model to evaluation mode, which disables dropout (all neurons active) and switches batch normalization to use running statistics instead of batch statistics. Forgetting model.eval() during inference is a common bug that produces noisy, incorrect predictions.

19. What is the difference between torch.no_grad() and model.eval()?

model.eval() changes layer behavior (disables dropout, changes batchnorm). torch.no_grad() disables gradient computation, saving memory and computation. Both are needed for inference: model.eval() for correct behavior, torch.no_grad() for efficiency. They serve different purposes and are not interchangeable.

20. What is the difference between nn.CrossEntropyLoss and nn.NLLLoss in PyTorch?

nn.CrossEntropyLoss = log_softmax + NLLLoss combined. It expects raw logits (pre-softmax values). nn.NLLLoss expects log-probabilities (after log_softmax). Using CrossEntropyLoss is preferred because it is numerically more stable than doing softmax and log separately.

Whiteboard Drawing Tips

💡

Strong whiteboard explanations follow a predictable structure. Practice these patterns until they are automatic:

Start With the Data Flow

Always draw input at the top, output at the bottom. Draw the data flowing downward through processing blocks. Label every arrow with the tensor shape: (batch, seq_len, d_model) or (B, C, H, W). Interviewers love seeing that you understand shapes.

Use Boxes for Layers

Draw each major operation as a labeled rectangle: [Conv 3x3, 64], [MultiHead Attn], [LayerNorm], [FFN]. Use color or shading to distinguish different types of operations if possible.

Show Skip Connections Clearly

Draw skip connections as curved arrows on the side that bypass the main path. Label them with "+" to show addition. These are critical for ResNet, Transformer, and U-Net explanations.

Write Key Equations Nearby

After the diagram, write the 1-2 most important equations next to the relevant component. For attention: softmax(QK^T/sqrt(d_k))V. For ResNet: y = F(x) + x. Keep it concise — do not derive from scratch unless asked.

Practice These Architectures

You should be able to draw from memory: a Transformer block (attention + FFN + residuals + norms), a ResNet bottleneck block, an LSTM cell with gates, a U-Net, and a GAN (generator + discriminator with gradient flow).

End With Trade-offs

After explaining an architecture, proactively mention 1-2 trade-offs or alternatives. "This uses O(n^2) attention; for longer sequences we might use sparse attention or Mamba." This demonstrates senior-level thinking.

Common Mistakes to Avoid

⚠

Not asking clarifying questions. When asked "design a model for X," ask about data size, latency requirements, available compute, and success metrics before proposing an architecture.
Saying "I would use a Transformer" without explaining why. Always compare at least one alternative and explain your reasoning for choosing the Transformer (or any other architecture).
Forgetting about data. The interviewer asks about architecture but you never mention data preprocessing, augmentation, or potential data issues. Real ML is 80% data.
Writing pseudocode instead of real code. In coding questions, write actual PyTorch: import statements, correct tensor operations, correct shapes. Test with a print(output.shape) at the end.
Not knowing what you do not know. Saying "I'm not sure about the exact formulation, but conceptually it works like..." is much better than guessing wrong with confidence.
Ignoring deployment and efficiency. Mentioning model size, inference latency, and serving cost shows you think beyond just training accuracy.
Memorizing without understanding. If you cannot explain WHY something works (not just WHAT it does), the interviewer will probe and find the gap.
Not connecting topics. The best candidates link ideas across areas: "Residual connections in ResNet solve the same vanishing gradient problem that LSTM gates address, but for feedforward networks."

Frequently Asked Questions

Click each question to reveal the answer.

How much math do I need for a DL interview? ▼

You need comfortable familiarity with: linear algebra (matrix multiplication, eigenvalues, vector norms), calculus (chain rule, partial derivatives, gradients), probability (Bayes' theorem, distributions, KL divergence), and basic optimization (gradient descent, convexity). You do not need to derive everything from scratch, but you should be able to explain the intuition behind formulas like the attention equation, cross-entropy loss, and the reparameterization trick. For senior roles at AI research labs, expect deeper math; for applied ML roles, focus on intuition and implementation.

Should I implement everything from scratch or use PyTorch built-ins? ▼

Unless the interviewer specifically asks you to implement from scratch (e.g., "implement multi-head attention without using nn.MultiheadAttention"), use built-in PyTorch modules. Show that you know the API. However, be prepared to explain what happens inside any module you use. A good rule: use nn.Linear, nn.Conv2d, etc. but be ready to write the forward pass manually if asked. The key implementations to practice from scratch: multi-head attention, a training loop, a ResNet block, an LSTM cell, and a basic GAN.

How do I handle questions about papers I have not read? ▼

Be honest: "I have not read that specific paper, but based on the name/context, I believe it addresses [problem]. Here is how I would approach that problem..." Then describe a reasonable approach. Interviewers test your thinking process more than your paper knowledge. Key papers to actually read: Attention Is All You Need (Transformer), ResNet, BERT, GPT-2/3, LoRA, and one diffusion model paper (DDPM or Stable Diffusion). Knowing these six papers deeply covers most questions.

What if I get a question on a topic not covered in this course? ▼

Use the framework: 1) Identify what type of problem it is (classification, generation, sequence modeling, etc.). 2) Connect it to something you know ("This is similar to [known concept] because..."). 3) Reason from first principles. 4) Be honest about the limits of your knowledge. Topics that occasionally appear but are not covered here: graph neural networks, reinforcement learning for fine-tuning (RLHF/DPO), neural architecture search, meta-learning, and federated learning. A surface-level understanding of each is sufficient for most interviews.

How long should my answers be? ▼

Rapid-fire questions: 30-60 seconds. Give a concise, accurate answer. Deep-dive questions: 3-5 minutes. Start with a high-level answer (30 seconds), then go into details. Check in: "Would you like me to go deeper on any part?" Coding questions: narrate as you code. Explain your approach for 1-2 minutes, then code for 15-25 minutes. The biggest mistake is rambling — give a structured answer and let the interviewer steer the depth.

Do I need to know the latest models (GPT-4, Claude, Gemini)? ▼

You should know the general architecture (decoder-only Transformer), key training techniques (RLHF, instruction tuning, scaling laws), and high-level capabilities. You do not need to know proprietary details. What matters more: understanding WHY these models work (Transformer architecture, pre-training on web data, emergent capabilities at scale) rather than specific benchmark numbers. For AI startup interviews, deeper knowledge of recent techniques (DPO, constitutional AI, mixture of experts) is expected.

How do I prepare for design/open-ended questions? ▼

Use a structured framework: 1) Clarify requirements (data, scale, latency, metrics). 2) Data strategy (where does data come from, how much, any issues like class imbalance). 3) Model choice with justification (compare 2-3 options). 4) Training plan (loss function, optimizer, augmentation). 5) Evaluation (offline metrics + online metrics like A/B test). 6) Deployment considerations (latency, model size, monitoring). Practice with: "Design a hate speech detector for a social media platform" or "Build a product recommendation system for an e-commerce site."

Should I use TensorFlow or PyTorch for interviews? ▼

PyTorch is the overwhelming default for DL interviews in 2025. It is used in most research, at most AI companies, and is the expected framework unless told otherwise. Google teams may use JAX/Flax. If the job posting mentions TensorFlow, prepare for that. But if no framework is specified, always default to PyTorch. Key APIs to know cold: nn.Module, forward(), nn.Linear, nn.Conv2d, nn.LSTM, nn.MultiheadAttention, optim.AdamW, DataLoader, and the train/eval pattern.

Your Final Checklist

💡

Before your interview, make sure you can:

Explain why non-linear activations are necessary (and compare ReLU, GELU, Swish)
Calculate CNN output dimensions given input, kernel, stride, and padding
Draw an LSTM cell with all four gates and explain the cell state highway
Implement multi-head attention from scratch in PyTorch (the number one coding question)
Explain scaling by sqrt(d_k), KV-cache, and why Transformers replaced RNNs
Compare BERT and GPT (encoder vs decoder, bidirectional vs causal)
Write a complete training loop with model.eval(), torch.no_grad(), gradient clipping, and LR scheduling
Explain mixed precision training, gradient accumulation, and LoRA
Compare GANs, VAEs, and diffusion models with trade-offs for each
Describe what FID measures and how classifier-free guidance works
Debug a training run: NaN loss, overfitting, underfitting, dying ReLU
Draw any architecture on a whiteboard: big picture first, zoom in, add math, discuss trade-offs

← Previous Generative Models Back to Course → Deep Learning Interview