PyTorch in Coding Interviews
PyTorch is the default framework for deep learning interviews at NVIDIA, Meta, Google DeepMind, and OpenAI. This lesson covers what interviewers actually test, reviews the tensor and autograd fundamentals you must know cold, and maps out the challenges ahead.
What DL Interviews Actually Test
Deep learning coding interviews are fundamentally different from traditional software engineering interviews. You are not solving LeetCode problems — you are building neural network components, writing training infrastructure, and debugging model issues. The interviewer is evaluating whether you can ship real DL systems.
Implementation Fluency
Can you implement multi-head attention, layer norm, or a residual block from memory? Interviewers expect you to write nn.Module subclasses without looking at documentation.
Training Infrastructure
Can you write a complete training loop with gradient clipping, mixed precision, LR scheduling, and checkpointing? This is day-one knowledge for any DL role.
Debugging Instinct
Given a training script with a subtle bug (wrong dimension, detached gradient, memory leak), can you find and fix it? This separates senior from junior engineers.
Loss Function Design
Can you implement focal loss for imbalanced classification, triplet loss for embeddings, or dice loss for segmentation? You need to know when and why to use each one.
Tensor Basics Review
Before tackling challenges, make sure these tensor fundamentals are second nature. Every challenge in this course builds on them.
import torch
import torch.nn as nn
import torch.nn.functional as F
# ---- Tensor Creation ----
x = torch.tensor([1.0, 2.0, 3.0]) # from list
z = torch.zeros(3, 4) # 3x4 zeros
o = torch.ones(2, 3) # 2x3 ones
r = torch.randn(3, 4) # standard normal
e = torch.eye(3) # identity
a = torch.arange(0, 10, 2) # [0, 2, 4, 6, 8]
l = torch.linspace(0, 1, 5) # [0, 0.25, 0.5, 0.75, 1.0]
# ---- Key Properties ----
print(x.shape) # torch.Size([3])
print(x.dtype) # torch.float32
print(x.device) # cpu
print(x.requires_grad) # False
# ---- Shape Manipulation ----
x = torch.arange(12).float()
x_2d = x.view(3, 4) # reshape (contiguous only)
x_2d = x.reshape(3, 4) # reshape (always works)
x_T = x_2d.T # transpose
x_p = x_2d.permute(1, 0) # permute dimensions
x_u = x.unsqueeze(0) # add dimension: (12,) -> (1, 12)
x_s = x_u.squeeze(0) # remove dimension: (1, 12) -> (12,)
# ---- view vs reshape vs contiguous ----
# view requires contiguous memory layout
# reshape works always (copies if needed)
# After transpose/permute, tensor is NOT contiguous
y = x_2d.T # not contiguous
# y.view(-1) # ERROR!
y_flat = y.contiguous().view(-1) # OK
y_flat = y.reshape(-1) # OK (handles it internally)
Autograd Review
Autograd is the engine behind PyTorch. Every custom layer, loss function, and training loop in this course depends on understanding how gradients flow.
import torch
# ---- Basic Gradient Computation ----
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = x.pow(2).sum() # y = x0^2 + x1^2
y.backward() # compute gradients
print(x.grad) # tensor([4., 6.]) -- dy/dx = 2x
# ---- The Computation Graph ----
# PyTorch builds a DAG of operations dynamically
# .backward() traverses this graph in reverse
# Leaf tensors (created by user) accumulate gradients
# Non-leaf tensors (results of operations) do NOT store gradients by default
w = torch.randn(3, 4, requires_grad=True) # leaf tensor
b = torch.zeros(4, requires_grad=True) # leaf tensor
x = torch.randn(5, 3) # no grad needed for input
y = x @ w + b # non-leaf (result of ops)
loss = y.sum()
loss.backward()
print(w.grad.shape) # (3, 4) -- same shape as w
# ---- Critical: Gradient Accumulation ----
# Gradients ACCUMULATE across .backward() calls
# You MUST zero them before each step
w.grad.zero_() # or optimizer.zero_grad()
# ---- Detaching from the Graph ----
# .detach() creates a tensor that shares data but has no grad history
z = y.detach() # z has same values as y, but no grad_fn
# This is essential for:
# - target values in loss computation
# - stopping gradient flow in certain architectures
# - moving tensors to numpy: y.detach().cpu().numpy()
# ---- torch.no_grad() Context ----
# Disables gradient tracking for inference/evaluation
with torch.no_grad():
predictions = model(test_input)
# No computation graph is built -- faster, less memory
nn.Module, register parameters with nn.Parameter, and implement forward(). Never use raw tensors with requires_grad=True for model parameters — that is the number one red flag interviewers look for.The nn.Module Pattern
Every custom layer in PyTorch follows the same pattern. Internalize this template — you will use it in every interview challenge.
import torch
import torch.nn as nn
class MyLayer(nn.Module):
def __init__(self, in_features, out_features):
super().__init__()
# Register learnable parameters
self.weight = nn.Parameter(torch.randn(in_features, out_features))
self.bias = nn.Parameter(torch.zeros(out_features))
# Register non-learnable state with register_buffer
self.register_buffer('running_mean', torch.zeros(out_features))
def forward(self, x):
# x: (batch_size, in_features)
out = x @ self.weight + self.bias # (batch_size, out_features)
return out
# ---- What nn.Module gives you for free ----
layer = MyLayer(10, 5)
# Automatic parameter tracking
list(layer.parameters()) # [weight, bias]
list(layer.named_parameters()) # [('weight', ...), ('bias', ...)]
# Automatic device movement
layer.to('cuda') # moves ALL parameters and buffers
# Automatic train/eval mode
layer.train() # sets self.training = True
layer.eval() # sets self.training = False
# Automatic serialization
torch.save(layer.state_dict(), 'model.pt')
layer.load_state_dict(torch.load('model.pt'))
Interviewer Evaluation Rubric
# What DL interviewers evaluate (based on public interview reports):
evaluation_criteria = {
"nn_module_usage": {
"weight": "HIGH",
"description": "Does the candidate use nn.Module correctly?",
"red_flag": "Using raw tensors with requires_grad for model params",
"green_flag": "nn.Parameter, register_buffer, proper __init__/forward"
},
"shape_awareness": {
"weight": "HIGH",
"description": "Can the candidate track tensor shapes through operations?",
"red_flag": "Trial-and-error shape debugging",
"green_flag": "Comments with shape annotations at each step"
},
"gradient_understanding": {
"weight": "HIGH",
"description": "Does the candidate understand autograd?",
"red_flag": "Forgetting to zero gradients, detach targets, or no_grad for eval",
"green_flag": "Correct gradient flow, knows when to detach"
},
"training_loop_completeness": {
"weight": "MEDIUM",
"description": "Can the candidate write a production training loop?",
"red_flag": "Missing eval mode, no gradient clipping, no checkpointing",
"green_flag": "Complete loop with all best practices"
},
"debugging_ability": {
"weight": "MEDIUM",
"description": "Can the candidate find and fix bugs?",
"red_flag": "Cannot identify common issues like shape mismatches",
"green_flag": "Systematically checks shapes, dtypes, devices, grad flow"
}
}
Course Roadmap
Course Structure:
Lesson 2: Tensor Operations (6 challenges)
- Reshaping & views, broadcasting, advanced indexing
- Einsum, gradient computation, device management
- Foundation for all subsequent lessons
Lesson 3: Custom Layers & Modules (5 challenges)
- Linear layer from scratch, multi-head attention
- Layer normalization, residual block, positional encoding
- The most common DL interview question type
Lesson 4: Training Loops (5 challenges)
- Complete training loop, LR scheduling
- Gradient clipping, mixed precision, checkpointing
- Production-grade training infrastructure
Lesson 5: Custom Loss Functions (5 challenges)
- Focal loss, triplet loss, contrastive loss
- Dice loss, custom regularization
- Essential for specialized ML applications
Lesson 6: Datasets & DataLoaders (5 challenges)
- Custom Dataset, augmentation pipeline
- Collate functions, distributed sampling, streaming
- Data pipeline engineering
Lesson 7: Debugging & Optimization (5 challenges)
- Finding bugs, memory optimization, profiling
- Gradient checking, NaN detection
- The skills that save days of debugging
Lesson 8: Patterns & Tips
- PyTorch idioms, common pitfalls, FAQ
Key Takeaways
- DL interviews test implementation fluency, not theory — you must code layers from memory
- Always use
nn.Modulewithnn.Parameter— never raw tensors for model weights - Understand autograd deeply: gradient accumulation, detach, no_grad, computation graph
- Shape tracking is critical — annotate shapes in comments during interviews
- The patterns you learn here transfer directly to JAX, TensorFlow, and any new framework
Lilly Tech Systems