Advanced

Patterns & Tips

Your one-page reference for PyTorch in DL coding interviews. Covers the most important idioms, common pitfalls, production patterns, and frequently asked questions.

PyTorch Idioms Cheat Sheet

Model Definition

import torch
import torch.nn as nn
import torch.nn.functional as F

# ---- Always subclass nn.Module ----
class MyModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embed = nn.Embedding(config.vocab_size, config.d_model)
        self.layers = nn.ModuleList([
            TransformerBlock(config) for _ in range(config.num_layers)
        ])
        self.head = nn.Linear(config.d_model, config.vocab_size, bias=False)
        # Weight tying (common in LLMs)
        self.head.weight = self.embed.weight

    def forward(self, input_ids):
        x = self.embed(input_ids)       # (B, S) -> (B, S, D)
        for layer in self.layers:
            x = layer(x)                 # (B, S, D) -> (B, S, D)
        logits = self.head(x)            # (B, S, D) -> (B, S, V)
        return logits

# ---- Use nn.ModuleList, NOT Python list ----
# WRONG: self.layers = [nn.Linear(10, 10) for _ in range(3)]
#   -> Parameters not registered! model.parameters() returns empty
# RIGHT: self.layers = nn.ModuleList([nn.Linear(10, 10) for _ in range(3)])

# ---- Use nn.ModuleDict for conditional modules ----
self.heads = nn.ModuleDict({
    'classification': nn.Linear(d_model, num_classes),
    'regression': nn.Linear(d_model, 1),
})

Training Patterns

# ---- The Complete Training Step ----
model.train()
optimizer.zero_grad()                    # 1. Zero gradients
with torch.amp.autocast('cuda'):         # 2. Mixed precision forward
    logits = model(inputs)
    loss = criterion(logits, targets)
scaler.scale(loss).backward()            # 3. Scaled backward
torch.nn.utils.clip_grad_norm_(          # 4. Gradient clipping
    model.parameters(), max_norm=1.0
)
scaler.step(optimizer)                   # 5. Optimizer step
scaler.update()                          # 6. Update scaler
scheduler.step()                         # 7. LR scheduler step

# ---- The Complete Evaluation Step ----
model.eval()
with torch.no_grad():
    logits = model(inputs)
    loss = criterion(logits, targets)
    preds = logits.argmax(dim=-1)
model.train()  # don't forget to restore

# ---- Gradient Accumulation ----
for i, (x, y) in enumerate(loader):
    loss = criterion(model(x), y) / accum_steps
    loss.backward()
    if (i + 1) % accum_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Essential Operations

# ---- Tensor Operations Quick Reference ----

# Shape manipulation
x.view(B, -1)                    # reshape (contiguous only)
x.reshape(B, -1)                 # reshape (always works)
x.permute(0, 2, 1)              # swap dimensions
x.unsqueeze(1)                   # add dimension
x.squeeze(1)                     # remove dimension
x.expand(B, S, D)               # broadcast without copying
x.contiguous()                   # ensure contiguous memory

# Indexing
x[torch.arange(B), labels]      # fancy indexing (gather equivalent)
x.gather(1, idx.unsqueeze(1))   # gather along dimension
x.scatter_(1, idx.unsqueeze(1), values)  # scatter (inverse of gather)
torch.where(cond, x, y)         # conditional select

# Reductions
x.mean(dim=1, keepdim=True)     # always use keepdim for broadcasting
x.sum(dim=-1)
x.max(dim=1)                    # returns (values, indices)
x.argmax(dim=-1)

# Linear algebra
x @ w                           # matrix multiply
torch.einsum('bhid,bhjd->bhij', Q, K)  # einsum
torch.linalg.norm(x, dim=-1)   # L2 norm

# Gradient control
x.detach()                       # remove from computation graph
with torch.no_grad(): ...       # disable gradient tracking
x.requires_grad_(True)          # enable gradient tracking

Common Pitfalls

1. Forgetting model.eval()

Bug: Dropout and BatchNorm behave differently in training vs eval mode. Forgetting model.eval() during validation gives wrong metrics. Fix: Always wrap validation in model.eval() + torch.no_grad().

2. Memory Leak from Tensors

Bug: total_loss += loss keeps the entire computation graph in memory. Fix: Use total_loss += loss.item() to extract the scalar value and free the graph.

3. In-place on Leaf Tensor

Bug: x.add_(1) on a tensor with requires_grad=True corrupts the gradient computation. Fix: Use x = x + 1 (out-of-place) for tensors that need gradients.

4. Wrong Dimension in Softmax

Bug: F.softmax(logits) without specifying dim gives a deprecation warning and may use the wrong axis. Fix: Always specify dim: F.softmax(logits, dim=-1).

5. Not Zeroing Gradients

Bug: Gradients accumulate by default. Without optimizer.zero_grad(), each backward adds to existing gradients. Fix: Call optimizer.zero_grad() before every loss.backward() (unless doing gradient accumulation).

6. Using Python List for Modules

Bug: self.layers = [nn.Linear(10, 10)] does not register parameters. model.parameters() returns nothing. Fix: Use nn.ModuleList or nn.Sequential.

Production Patterns

Pattern	Code
Device-agnostic	`device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')`
Reproducibility	`torch.manual_seed(42); torch.cuda.manual_seed_all(42)`
Model summary	`sum(p.numel() for p in model.parameters() if p.requires_grad)`
Freeze layers	`for p in model.backbone.parameters(): p.requires_grad_(False)`
Weight init	`nn.init.kaiming_normal_(m.weight, mode='fan_out')`
Save/load	`torch.save(model.state_dict(), path); model.load_state_dict(torch.load(path))`
Compile (2.0+)	`model = torch.compile(model) # up to 2x speedup`
Export to ONNX	`torch.onnx.export(model, dummy_input, 'model.onnx')`

Frequently Asked Questions

view returns a new tensor with the same data but different shape. It requires the tensor to be contiguous in memory (data stored in a single, uninterrupted block). After operations like transpose or permute, the tensor is no longer contiguous, so view will fail with a RuntimeError. reshape works the same as view when possible, but if the tensor is not contiguous, it copies the data to make it contiguous first. contiguous() explicitly copies data to contiguous memory. Best practice: Use reshape when you do not care about copies. Use view when you explicitly want to ensure no copy happens (so bugs surface early). Use .contiguous().view() when you need a view after transpose/permute.

nn.Parameter is for learnable weights that should receive gradients and be updated by the optimizer. Examples: weight matrices, bias vectors, embedding tables. register_buffer is for non-learnable state that should be part of the model's state_dict and move with .to(device) but should NOT receive gradients. Examples: running mean/variance in BatchNorm, positional encoding tables, boolean masks. If you use a plain tensor attribute (self.mask = torch.ones(10)), it will NOT be saved in state_dict and will NOT move when you call model.cuda() — this is a common bug.

torch.compile (PyTorch 2.0+) traces your model's computation graph and compiles it with TorchDynamo + a backend (default: TorchInductor). It can provide 1.5-2x speedup by fusing operations, eliminating Python overhead, and generating optimized GPU kernels. Use it for inference in production and for training when your model does not have highly dynamic control flow. Caveats: (1) The first call is slow (compilation), (2) Dynamic shapes may cause recompilation, (3) Some operations (custom CUDA kernels, certain autograd functions) may not be compatible. Best practice: Add model = torch.compile(model) after model creation and benchmark to see if it helps your specific workload.

Based on publicly shared interview experiences at NVIDIA, Meta, Google DeepMind, and AI startups: (1) Implement multi-head attention from scratch (most common), (2) Write a complete training loop with all best practices, (3) Implement a custom loss function (focal, triplet, or contrastive), (4) Debug a broken training script (find the bug), (5) Implement layer normalization from scratch, (6) Write a custom Dataset and DataLoader for variable-length sequences, (7) Explain the difference between model.train() and model.eval() and why it matters, (8) Implement gradient checkpointing and explain the memory-compute tradeoff. The common thread: every question tests whether you can build real DL systems, not just use high-level APIs.

Three approaches: (1) Padding + attention mask: Pad all sequences to max length, create a boolean mask (1 for real tokens, 0 for padding), and pass the mask to attention layers. This is the standard for transformers. Use torch.nn.utils.rnn.pad_sequence for padding and a custom collate function. (2) PackedSequence: For RNNs, use pack_padded_sequence and pad_packed_sequence to avoid computing over padding tokens. This is more memory efficient but only works with RNNs. (3) Bucketing: Sort samples by length and group similar lengths into the same batch. This minimizes padding waste. In production, combine bucketing with padding + attention masks for best efficiency.

DataParallel (DP) is simple but inefficient: it replicates the model to all GPUs every forward pass and gathers outputs on GPU 0, creating a bottleneck. DistributedDataParallel (DDP) is the production standard: each GPU runs its own process with its own model replica, and gradients are synchronized via all-reduce after backward. DDP is faster because (1) no model replication every step, (2) gradient sync overlaps with backward, (3) no single-GPU bottleneck. Always use DDP for multi-GPU training. DP is deprecated in practice. For multi-node training, use torchrun or torch.distributed.launch to start DDP processes.

Weight initialization significantly impacts training dynamics. Common strategies: (1) Kaiming (He) initialization: Use for ReLU networks. nn.init.kaiming_normal_(weight, mode='fan_out', nonlinearity='relu'). PyTorch's default for Linear layers. (2) Xavier (Glorot) initialization: Use for sigmoid/tanh networks. nn.init.xavier_uniform_(weight). (3) Normal with small std: Common for transformer models. nn.init.normal_(weight, std=0.02). GPT-2 uses this. (4) Scaled initialization: For deep transformers, scale the residual branch weights by 1/sqrt(2*num_layers) to keep activations stable. Always initialize biases to zero. Always apply init in a _init_weights method called from __init__ or via model.apply(init_fn).

Course Summary

💡

Lesson 1: DL interviews test implementation fluency — always use nn.Module, nn.Parameter, and proper autograd
Lesson 2: Reshape, broadcast, index, einsum, gradient computation, device management — the foundation
Lesson 3: Linear layer, multi-head attention, layer norm, residual block, positional encoding — the most asked questions
Lesson 4: Complete training loop, LR scheduling, gradient clipping, mixed precision, checkpointing — production infrastructure
Lesson 5: Focal, triplet, contrastive, dice losses and custom regularization — specialized applications
Lesson 6: Custom datasets, augmentation, collate functions, distributed sampling, streaming — data pipeline engineering
Lesson 7: Finding bugs, memory optimization, profiling, gradient checking, NaN detection — senior engineer skills
Lesson 8 (this): Cheat sheet, pitfalls, production patterns, and FAQ for quick review

💡

Final advice: In DL coding interviews, (1) always annotate tensor shapes in comments, (2) use nn.Module and nn.Parameter from the start, (3) mention train/eval mode switching proactively, (4) use F.log_softmax instead of torch.log(F.softmax()) for numerical stability, and (5) structure your solution as init + forward before writing any code. This systematic approach signals production experience.

← Previous Debugging & Optimization Course Home → PyTorch Coding Challenges