Beginner

PyTorch in Coding Interviews

PyTorch is the default framework for deep learning interviews at NVIDIA, Meta, Google DeepMind, and OpenAI. This lesson covers what interviewers actually test, reviews the tensor and autograd fundamentals you must know cold, and maps out the challenges ahead.

What DL Interviews Actually Test

Deep learning coding interviews are fundamentally different from traditional software engineering interviews. You are not solving LeetCode problems — you are building neural network components, writing training infrastructure, and debugging model issues. The interviewer is evaluating whether you can ship real DL systems.

Implementation Fluency

Can you implement multi-head attention, layer norm, or a residual block from memory? Interviewers expect you to write nn.Module subclasses without looking at documentation.

Training Infrastructure

Can you write a complete training loop with gradient clipping, mixed precision, LR scheduling, and checkpointing? This is day-one knowledge for any DL role.

Debugging Instinct

Given a training script with a subtle bug (wrong dimension, detached gradient, memory leak), can you find and fix it? This separates senior from junior engineers.

Loss Function Design

Can you implement focal loss for imbalanced classification, triplet loss for embeddings, or dice loss for segmentation? You need to know when and why to use each one.

Tensor Basics Review

Before tackling challenges, make sure these tensor fundamentals are second nature. Every challenge in this course builds on them.

import torch
import torch.nn as nn
import torch.nn.functional as F

# ---- Tensor Creation ----
x = torch.tensor([1.0, 2.0, 3.0])              # from list
z = torch.zeros(3, 4)                            # 3x4 zeros
o = torch.ones(2, 3)                             # 2x3 ones
r = torch.randn(3, 4)                            # standard normal
e = torch.eye(3)                                 # identity
a = torch.arange(0, 10, 2)                       # [0, 2, 4, 6, 8]
l = torch.linspace(0, 1, 5)                      # [0, 0.25, 0.5, 0.75, 1.0]

# ---- Key Properties ----
print(x.shape)          # torch.Size([3])
print(x.dtype)          # torch.float32
print(x.device)         # cpu
print(x.requires_grad)  # False

# ---- Shape Manipulation ----
x = torch.arange(12).float()
x_2d = x.view(3, 4)               # reshape (contiguous only)
x_2d = x.reshape(3, 4)            # reshape (always works)
x_T = x_2d.T                      # transpose
x_p = x_2d.permute(1, 0)          # permute dimensions
x_u = x.unsqueeze(0)              # add dimension: (12,) -> (1, 12)
x_s = x_u.squeeze(0)              # remove dimension: (1, 12) -> (12,)

# ---- view vs reshape vs contiguous ----
# view requires contiguous memory layout
# reshape works always (copies if needed)
# After transpose/permute, tensor is NOT contiguous
y = x_2d.T                        # not contiguous
# y.view(-1)                      # ERROR!
y_flat = y.contiguous().view(-1)   # OK
y_flat = y.reshape(-1)            # OK (handles it internally)

Autograd Review

Autograd is the engine behind PyTorch. Every custom layer, loss function, and training loop in this course depends on understanding how gradients flow.

import torch

# ---- Basic Gradient Computation ----
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = x.pow(2).sum()    # y = x0^2 + x1^2
y.backward()           # compute gradients
print(x.grad)          # tensor([4., 6.])  -- dy/dx = 2x

# ---- The Computation Graph ----
# PyTorch builds a DAG of operations dynamically
# .backward() traverses this graph in reverse
# Leaf tensors (created by user) accumulate gradients
# Non-leaf tensors (results of operations) do NOT store gradients by default

w = torch.randn(3, 4, requires_grad=True)  # leaf tensor
b = torch.zeros(4, requires_grad=True)     # leaf tensor
x = torch.randn(5, 3)                       # no grad needed for input
y = x @ w + b                               # non-leaf (result of ops)
loss = y.sum()
loss.backward()
print(w.grad.shape)    # (3, 4) -- same shape as w

# ---- Critical: Gradient Accumulation ----
# Gradients ACCUMULATE across .backward() calls
# You MUST zero them before each step
w.grad.zero_()  # or optimizer.zero_grad()

# ---- Detaching from the Graph ----
# .detach() creates a tensor that shares data but has no grad history
z = y.detach()   # z has same values as y, but no grad_fn
# This is essential for:
# - target values in loss computation
# - stopping gradient flow in certain architectures
# - moving tensors to numpy: y.detach().cpu().numpy()

# ---- torch.no_grad() Context ----
# Disables gradient tracking for inference/evaluation
with torch.no_grad():
    predictions = model(test_input)
    # No computation graph is built -- faster, less memory

💡

Interview rule of thumb: If you are asked to implement a custom layer, always subclass nn.Module, register parameters with nn.Parameter, and implement forward(). Never use raw tensors with requires_grad=True for model parameters — that is the number one red flag interviewers look for.

The nn.Module Pattern

Every custom layer in PyTorch follows the same pattern. Internalize this template — you will use it in every interview challenge.

import torch
import torch.nn as nn

class MyLayer(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        # Register learnable parameters
        self.weight = nn.Parameter(torch.randn(in_features, out_features))
        self.bias = nn.Parameter(torch.zeros(out_features))
        # Register non-learnable state with register_buffer
        self.register_buffer('running_mean', torch.zeros(out_features))

    def forward(self, x):
        # x: (batch_size, in_features)
        out = x @ self.weight + self.bias  # (batch_size, out_features)
        return out

# ---- What nn.Module gives you for free ----
layer = MyLayer(10, 5)

# Automatic parameter tracking
list(layer.parameters())      # [weight, bias]
list(layer.named_parameters())  # [('weight', ...), ('bias', ...)]

# Automatic device movement
layer.to('cuda')               # moves ALL parameters and buffers

# Automatic train/eval mode
layer.train()                  # sets self.training = True
layer.eval()                   # sets self.training = False

# Automatic serialization
torch.save(layer.state_dict(), 'model.pt')
layer.load_state_dict(torch.load('model.pt'))

Interviewer Evaluation Rubric

# What DL interviewers evaluate (based on public interview reports):

evaluation_criteria = {
    "nn_module_usage": {
        "weight": "HIGH",
        "description": "Does the candidate use nn.Module correctly?",
        "red_flag": "Using raw tensors with requires_grad for model params",
        "green_flag": "nn.Parameter, register_buffer, proper __init__/forward"
    },
    "shape_awareness": {
        "weight": "HIGH",
        "description": "Can the candidate track tensor shapes through operations?",
        "red_flag": "Trial-and-error shape debugging",
        "green_flag": "Comments with shape annotations at each step"
    },
    "gradient_understanding": {
        "weight": "HIGH",
        "description": "Does the candidate understand autograd?",
        "red_flag": "Forgetting to zero gradients, detach targets, or no_grad for eval",
        "green_flag": "Correct gradient flow, knows when to detach"
    },
    "training_loop_completeness": {
        "weight": "MEDIUM",
        "description": "Can the candidate write a production training loop?",
        "red_flag": "Missing eval mode, no gradient clipping, no checkpointing",
        "green_flag": "Complete loop with all best practices"
    },
    "debugging_ability": {
        "weight": "MEDIUM",
        "description": "Can the candidate find and fix bugs?",
        "red_flag": "Cannot identify common issues like shape mismatches",
        "green_flag": "Systematically checks shapes, dtypes, devices, grad flow"
    }
}

Course Roadmap

Course Structure:

Lesson 2: Tensor Operations (6 challenges)
  - Reshaping & views, broadcasting, advanced indexing
  - Einsum, gradient computation, device management
  - Foundation for all subsequent lessons

Lesson 3: Custom Layers & Modules (5 challenges)
  - Linear layer from scratch, multi-head attention
  - Layer normalization, residual block, positional encoding
  - The most common DL interview question type

Lesson 4: Training Loops (5 challenges)
  - Complete training loop, LR scheduling
  - Gradient clipping, mixed precision, checkpointing
  - Production-grade training infrastructure

Lesson 5: Custom Loss Functions (5 challenges)
  - Focal loss, triplet loss, contrastive loss
  - Dice loss, custom regularization
  - Essential for specialized ML applications

Lesson 6: Datasets & DataLoaders (5 challenges)
  - Custom Dataset, augmentation pipeline
  - Collate functions, distributed sampling, streaming
  - Data pipeline engineering

Lesson 7: Debugging & Optimization (5 challenges)
  - Finding bugs, memory optimization, profiling
  - Gradient checking, NaN detection
  - The skills that save days of debugging

Lesson 8: Patterns & Tips
  - PyTorch idioms, common pitfalls, FAQ

Key Takeaways

💡

DL interviews test implementation fluency, not theory — you must code layers from memory
Always use nn.Module with nn.Parameter — never raw tensors for model weights
Understand autograd deeply: gradient accumulation, detach, no_grad, computation graph
Shape tracking is critical — annotate shapes in comments during interviews
The patterns you learn here transfer directly to JAX, TensorFlow, and any new framework

Next → Tensor Operations