Practice Problems & Tips
Put everything together with 10 timed practice problems that mirror real interview questions, learn to debug common ML code bugs, and get answers to the most frequently asked interview questions.
10 Timed Practice Problems
Set a timer for each problem. The time limit reflects what you would get in a real interview. Try solving each problem before revealing the solution.
Problem 1: Normalize a Dataset (10 minutes)
Reveal Solution
import numpy as np
def normalize(X):
"""Normalize features to zero mean and unit variance."""
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)
std[std == 0] = 1.0 # Avoid division by zero
return (X - mean) / std
# Test
X = np.array([[1, 200], [2, 400], [3, 100], [4, 300], [5, 500]])
X_norm = normalize(X.astype(float))
print(f"Means: {np.mean(X_norm, axis=0)}") # [0, 0]
print(f"Stds: {np.std(X_norm, axis=0)}") # [1, 1]
Problem 2: Train-Test Split (10 minutes)
train_test_split(X, y, test_size=0.2, random_state=42) from scratch. Ensure X and y are shuffled together and the split respects the test_size ratio.Reveal Solution
def train_test_split(X, y, test_size=0.2, random_state=None):
"""Split data into training and test sets."""
n = len(X)
indices = np.arange(n)
if random_state is not None:
rng = np.random.RandomState(random_state)
rng.shuffle(indices)
else:
np.random.shuffle(indices)
split_idx = int(n * (1 - test_size))
X_train = X[indices[:split_idx]]
X_test = X[indices[split_idx:]]
y_train = y[indices[:split_idx]]
y_test = y[indices[split_idx:]]
return X_train, X_test, y_train, y_test
# Test
X = np.arange(100).reshape(50, 2)
y = np.arange(50)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Train: {len(X_tr)}, Test: {len(X_te)}") # 40, 10
Problem 3: Softmax Function (10 minutes)
Reveal Solution
def softmax(z):
"""Numerically stable softmax."""
z = np.atleast_2d(z)
# Subtract max per row for numerical stability
# This prevents exp() overflow without changing the result
z_shifted = z - np.max(z, axis=1, keepdims=True)
exp_z = np.exp(z_shifted)
return exp_z / np.sum(exp_z, axis=1, keepdims=True)
# Test
logits = np.array([[1000, 1001, 1002]]) # Would overflow without stability trick
print(softmax(logits)) # Should work, sum to 1
logits_2d = np.array([[1, 2, 3], [4, 5, 6]])
result = softmax(logits_2d)
print(result)
print(f"Row sums: {np.sum(result, axis=1)}") # [1.0, 1.0]
Why stable: Subtracting the max prevents exp() from producing infinity. Mathematically: softmax(z - c) = softmax(z) for any constant c.
Problem 4: Gradient Checking (15 minutes)
Reveal Solution
def numerical_gradient(f, x, eps=1e-5):
"""
Compute numerical gradient using central differences.
f: function that takes x and returns a scalar loss
x: numpy array of parameters
"""
grad = np.zeros_like(x)
for i in range(len(x)):
x_plus = x.copy()
x_minus = x.copy()
x_plus[i] += eps
x_minus[i] -= eps
grad[i] = (f(x_plus) - f(x_minus)) / (2 * eps)
return grad
def gradient_check(analytical_grad, f, x, eps=1e-5, threshold=1e-5):
"""
Compare analytical and numerical gradients.
Returns True if they match within threshold.
"""
num_grad = numerical_gradient(f, x, eps)
# Relative difference
diff = np.linalg.norm(analytical_grad - num_grad) / \
(np.linalg.norm(analytical_grad) + np.linalg.norm(num_grad) + 1e-15)
passed = diff < threshold
print(f"Analytical: {analytical_grad}")
print(f"Numerical: {num_grad}")
print(f"Relative difference: {diff:.2e}")
print(f"Gradient check: {'PASSED' if passed else 'FAILED'}")
return passed
# Test: verify gradient of f(w) = w^2 (gradient should be 2w)
def loss_fn(w):
return np.sum(w ** 2)
w = np.array([3.0, -2.0, 5.0])
analytical = 2 * w
gradient_check(analytical, loss_fn, w) # Should pass
Problem 5: Implement Batch Gradient Descent vs Mini-Batch (15 minutes)
Reveal Solution
class LinearRegressionMiniBatch:
def __init__(self, lr=0.01, n_iter=100, batch_size=32):
self.lr = lr
self.n_iter = n_iter
self.batch_size = batch_size
self.weights = None
self.bias = None
def fit(self, X, y):
n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
self.bias = 0.0
for epoch in range(self.n_iter):
# Shuffle data at each epoch
indices = np.random.permutation(n_samples)
X_shuffled = X[indices]
y_shuffled = y[indices]
# Process mini-batches
for start in range(0, n_samples, self.batch_size):
end = min(start + self.batch_size, n_samples)
X_batch = X_shuffled[start:end]
y_batch = y_shuffled[start:end]
batch_n = len(X_batch)
y_pred = X_batch @ self.weights + self.bias
dw = (1 / batch_n) * (X_batch.T @ (y_pred - y_batch))
db = (1 / batch_n) * np.sum(y_pred - y_batch)
self.weights -= self.lr * dw
self.bias -= self.lr * db
return self
def predict(self, X):
return X @ self.weights + self.bias
# batch_size = n_samples -> Batch GD (smooth but slow)
# batch_size = 1 -> Stochastic GD (noisy but fast)
# batch_size = 32 or 64 -> Mini-batch (balanced)
X = 2 * np.random.rand(200, 1)
y = 4 + 3 * X.squeeze() + np.random.randn(200) * 0.5
model = LinearRegressionMiniBatch(lr=0.1, n_iter=50, batch_size=32)
model.fit(X, y)
print(f"Weight: {model.weights[0]:.4f}, Bias: {model.bias:.4f}")
Problem 6: Implement Polynomial Features (10 minutes)
Reveal Solution
from itertools import combinations_with_replacement
def polynomial_features(X, degree=2):
"""
Generate polynomial features up to the given degree.
Includes bias term (column of 1s) and interaction terms.
"""
n_samples, n_features = X.shape
features = [np.ones(n_samples)] # Bias term
# Generate all combinations with replacement
for d in range(1, degree + 1):
for combo in combinations_with_replacement(range(n_features), d):
feature = np.ones(n_samples)
for idx in combo:
feature *= X[:, idx]
features.append(feature)
return np.column_stack(features)
# Test
X = np.array([[2, 3], [4, 5]])
X_poly = polynomial_features(X, degree=2)
print(f"Shape: {X_poly.shape}") # (2, 6)
print(f"Features for [2,3]: {X_poly[0]}")
# Should be [1, 2, 3, 4, 6, 9] = [1, x1, x2, x1^2, x1*x2, x2^2]
Problem 7: Implement Cosine Similarity Search (15 minutes)
Reveal Solution
def cosine_similarity_search(query, documents, top_k=5):
"""
Find the top-K most similar documents to the query.
Args:
query: 1D array of shape (d,)
documents: 2D array of shape (n_docs, d)
top_k: number of results to return
Returns:
indices: indices of top-K documents
similarities: cosine similarity scores
"""
# Normalize query
query_norm = query / (np.linalg.norm(query) + 1e-15)
# Normalize documents (each row)
doc_norms = np.linalg.norm(documents, axis=1, keepdims=True)
doc_norms = np.maximum(doc_norms, 1e-15) # Avoid division by zero
docs_normalized = documents / doc_norms
# Cosine similarity = dot product of normalized vectors
similarities = docs_normalized @ query_norm
# Get top-K indices (descending)
top_indices = np.argsort(-similarities)[:top_k]
return top_indices, similarities[top_indices]
# Test
np.random.seed(42)
query = np.array([1.0, 0.0, 1.0, 0.0])
documents = np.random.randn(100, 4)
documents[42] = query * 5 # Make doc 42 very similar
indices, scores = cosine_similarity_search(query, documents, top_k=3)
print(f"Top indices: {indices}") # 42 should be first
print(f"Scores: {scores}")
Problem 8: Implement TF-IDF (20 minutes)
Reveal Solution
from collections import Counter
import math
def compute_tfidf(documents):
"""
Compute TF-IDF matrix from scratch.
Args:
documents: list of strings
Returns:
tfidf_matrix: 2D numpy array (n_docs, n_vocab)
vocab: list of terms
"""
n_docs = len(documents)
# Tokenize
tokenized = [doc.lower().split() for doc in documents]
# Build vocabulary
vocab = sorted(set(word for doc in tokenized for word in doc))
word_to_idx = {word: i for i, word in enumerate(vocab)}
# Compute document frequency (DF) for each term
df = Counter()
for doc in tokenized:
unique_words = set(doc)
for word in unique_words:
df[word] += 1
# Compute TF-IDF
tfidf_matrix = np.zeros((n_docs, len(vocab)))
for doc_idx, doc in enumerate(tokenized):
# Term frequency
tf = Counter(doc)
n_terms = len(doc)
for word, count in tf.items():
col_idx = word_to_idx[word]
# TF: count / total terms in doc
tf_val = count / n_terms
# IDF: log(N / df) + 1 (smoothed)
idf_val = math.log(n_docs / df[word]) + 1
tfidf_matrix[doc_idx, col_idx] = tf_val * idf_val
return tfidf_matrix, vocab
# Test
docs = [
"the cat sat on the mat",
"the dog sat on the log",
"cats and dogs are friends"
]
matrix, vocab = compute_tfidf(docs)
print(f"Vocab: {vocab}")
print(f"TF-IDF shape: {matrix.shape}")
print(f"TF-IDF for doc 0:\n{dict(zip(vocab, matrix[0].round(3)))}")
Problem 9: Detect Data Leakage (15 minutes)
# BUGGY CODE - Find the leakage!
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np
X = np.random.randn(1000, 10)
y = np.random.randint(0, 2, 1000)
# Bug 1: Scaling BEFORE splitting
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # LEAKAGE: test info in train
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2
)
# Bug 2: Feature selection using ALL data
correlations = np.abs([np.corrcoef(X_scaled[:, i], y)[0,1]
for i in range(10)])
top_features = np.argsort(-correlations)[:5] # LEAKAGE
model = LogisticRegression()
model.fit(X_train[:, top_features], y_train)
score = model.score(X_test[:, top_features], y_test)
print(f"Test accuracy: {score:.2%}")
Reveal Solution
# FIXED CODE
X = np.random.randn(1000, 10)
y = np.random.randint(0, 2, 1000)
# Fix 1: Split FIRST, then scale
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Fit scaler on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit + transform
X_test_scaled = scaler.transform(X_test) # transform only!
# Fix 2: Feature selection using ONLY training data
correlations = np.abs([np.corrcoef(X_train_scaled[:, i], y_train)[0,1]
for i in range(10)])
top_features = np.argsort(-correlations)[:5]
model = LogisticRegression()
model.fit(X_train_scaled[:, top_features], y_train)
score = model.score(X_test_scaled[:, top_features], y_test)
print(f"Test accuracy: {score:.2%}")
# Why this matters:
# Bug 1: StandardScaler learned mean/std from ALL data including test.
# The training data's scaling is influenced by test samples.
# Bug 2: Feature selection used correlations from ALL data.
# Features were chosen knowing test set patterns.
# Both leaks make test performance artificially high.
Problem 10: Full Pipeline - End to End (30 minutes)
Reveal Solution
import numpy as np
# Step 1: Create synthetic dataset
np.random.seed(42)
n = 300
X_numeric = np.random.randn(n, 3)
X_numeric[np.random.rand(n, 3) < 0.1] = np.nan # Add missing values
categories = np.random.choice(['A', 'B', 'C'], n)
y = ((X_numeric[:, 0] > 0) if not np.any(np.isnan(X_numeric[:, 0]))
else np.random.randint(0, 2, n)).astype(int)
# Fix: create clean labels
y = np.random.randint(0, 2, n)
# Step 2: Handle missing values (mean imputation)
def impute_mean(X):
X_clean = X.copy()
for col in range(X_clean.shape[1]):
mask = np.isnan(X_clean[:, col])
if mask.any():
col_mean = np.nanmean(X_clean[:, col])
X_clean[mask, col] = col_mean
return X_clean
X_imputed = impute_mean(X_numeric)
# Step 3: One-hot encode categoricals
unique_cats = sorted(set(categories))
one_hot = np.zeros((n, len(unique_cats)))
for i, cat in enumerate(categories):
one_hot[i, unique_cats.index(cat)] = 1
# Combine
X_full = np.hstack([X_imputed, one_hot])
# Step 4: Train-test split
split = int(0.8 * n)
indices = np.random.permutation(n)
X_train, X_test = X_full[indices[:split]], X_full[indices[split:]]
y_train, y_test = y[indices[:split]], y[indices[split:]]
# Step 5: Normalize (fit on train only!)
train_mean = np.mean(X_train, axis=0)
train_std = np.std(X_train, axis=0)
train_std[train_std == 0] = 1.0
X_train_norm = (X_train - train_mean) / train_std
X_test_norm = (X_test - train_mean) / train_std
# Step 6: Train logistic regression from scratch
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
weights = np.zeros(X_train_norm.shape[1])
bias = 0.0
lr = 0.1
for _ in range(1000):
z = X_train_norm @ weights + bias
pred = sigmoid(z)
dw = (1 / len(X_train_norm)) * (X_train_norm.T @ (pred - y_train))
db = (1 / len(X_train_norm)) * np.sum(pred - y_train)
weights -= lr * dw
bias -= lr * db
# Step 7: Evaluate
y_pred = (sigmoid(X_test_norm @ weights + bias) >= 0.5).astype(int)
tp = np.sum((y_pred == 1) & (y_test == 1))
fp = np.sum((y_pred == 1) & (y_test == 0))
fn = np.sum((y_pred == 0) & (y_test == 1))
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
accuracy = np.mean(y_pred == y_test)
print(f"Accuracy: {accuracy:.2%}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
Debugging ML Code: Common Bugs
Interviewers sometimes give you broken code and ask you to find the bug. Here are the most common ML code bugs:
| Bug | Symptom | Fix |
|---|---|---|
| Shape mismatch in matrix multiply | ValueError: shapes not aligned | Print shapes before every @ operation |
| Forgetting to reshape y | Loss stays constant or NaN | Ensure y shape matches output: y.reshape(-1, 1) |
| Wrong sign in gradient | Loss increases each iteration | Gradient should be X.T @ (pred - y), not X.T @ (y - pred) |
| Missing bias update | Model has systematic offset | Always update both weights and bias |
| Learning rate too high | Loss oscillates or explodes to NaN | Start with 0.001 and increase gradually |
| Not normalizing features | Gradient descent converges very slowly | Standardize features before training |
| Data leakage | Train accuracy high, test accuracy much lower | Fit preprocessing on train only, transform test |
| Integer division in Python 2 style | Gradients are all zero | Use 1.0 / n or float() conversions |
| Not clipping sigmoid input | RuntimeWarning: overflow in exp | Use np.clip(z, -500, 500) |
| Forgetting epsilon in log | RuntimeWarning: divide by zero in log | Use np.log(x + 1e-15) |
Frequently Asked Questions
Click each question to reveal the answer. These are the most common non-coding questions asked during ML coding interviews.
What is the difference between bagging and boosting?
Bagging (Bootstrap Aggregating) trains multiple models independently on random subsets of data (with replacement) and averages their predictions. It reduces variance. Example: Random Forest.
Boosting trains models sequentially, where each new model focuses on the mistakes of the previous ones. It reduces bias. Example: XGBoost, AdaBoost, Gradient Boosting.
Key difference: Bagging models are independent (can be parallelized). Boosting models are sequential (must be trained in order).
How do you handle class imbalance?
Five main strategies:
- Oversampling the minority class (SMOTE, random oversampling)
- Undersampling the majority class
- Class weights in the loss function (weight rare class higher)
- Threshold tuning (lower the classification threshold for the rare class)
- Use appropriate metrics (F1, AUC-ROC, precision-recall AUC instead of accuracy)
Most important: never use accuracy as your metric for imbalanced data. A model that always predicts the majority class gets 99% accuracy on a 99/1 split but is useless.
What is the bias-variance tradeoff?
Bias: error from overly simplistic assumptions. High bias = underfitting (model too simple).
Variance: error from sensitivity to training data fluctuations. High variance = overfitting (model too complex).
The tradeoff: as model complexity increases, bias decreases but variance increases. The sweet spot is the complexity that minimizes total error (bias^2 + variance).
Practical examples: Linear regression has high bias, low variance. A deep decision tree has low bias, high variance. Random forests reduce variance through bagging while maintaining low bias.
When would you use L1 vs L2 regularization?
L1 (Lasso): Pushes some weights exactly to zero, effectively performing feature selection. Use when you suspect many features are irrelevant and want a sparse model.
L2 (Ridge): Shrinks all weights toward zero but never exactly zero. Use when all features are potentially relevant and you want to prevent any single feature from dominating.
Elastic Net: Combines L1 and L2. Use when you want some feature selection but also want to handle correlated features (L1 alone picks one arbitrarily).
How do you detect and prevent overfitting?
Detection:
- Training accuracy much higher than validation accuracy
- Validation loss increases while training loss decreases (divergence)
- Model performs well on training data but poorly on new data
Prevention:
- More training data (the single best fix)
- Regularization (L1, L2, dropout)
- Early stopping (stop training when validation loss stops improving)
- Cross-validation for reliable performance estimates
- Simpler model (fewer features, shallower tree, smaller network)
- Data augmentation (for images, text)
- Ensemble methods (bagging reduces variance)
Explain the curse of dimensionality.
As the number of features (dimensions) increases, the volume of the feature space grows exponentially. This has several consequences:
- Data becomes sparse: points are far from each other in high dimensions
- Distance metrics become meaningless: in high dimensions, the difference between the nearest and farthest neighbor shrinks
- More data needed: to maintain the same density of points, you need exponentially more samples
- KNN degrades: with many irrelevant features, nearest neighbors are dominated by noise dimensions
Solutions: dimensionality reduction (PCA, t-SNE), feature selection, regularization, domain knowledge to select relevant features.
What is gradient descent, and what are its variants?
Gradient descent is an optimization algorithm that iteratively adjusts parameters in the direction of steepest descent of the loss function.
Batch GD: Uses all training samples per update. Smooth convergence but slow for large datasets. Memory intensive.
Stochastic GD (SGD): Uses one sample per update. Very fast but noisy. Can escape local minima due to noise.
Mini-batch GD: Uses a subset (32–256 samples) per update. Best of both worlds. Standard in practice.
Advanced optimizers: Adam (adaptive learning rates per parameter), RMSprop (running average of squared gradients), Adagrad (accumulates past gradients). Adam is the most common default in deep learning.
How would you handle a feature with 1000 categories?
One-hot encoding would create 1000 columns, which is impractical. Better approaches:
- Target encoding: Replace each category with the mean of the target variable (with smoothing to prevent overfitting)
- Frequency encoding: Replace each category with its count or frequency in the dataset
- Embedding: Learn a dense vector representation (common in deep learning)
- Group rare categories: Combine categories that appear fewer than N times into an “other” category
- Hashing trick: Hash categories into a fixed number of buckets (some collisions but bounded dimensionality)
What is the difference between generative and discriminative models?
Discriminative models learn the decision boundary P(y|X) directly. Examples: logistic regression, SVM, neural networks, decision trees. They answer: “Given these features, which class?”
Generative models learn the joint distribution P(X, y) = P(X|y) * P(y). Examples: Naive Bayes, Gaussian Mixture Models, Hidden Markov Models. They can also generate new data samples.
Discriminative models generally achieve better classification accuracy with enough data. Generative models are useful when you have limited data, need to handle missing features, or want to generate synthetic data.
How do you choose between precision and recall as your optimization target?
Ask: “What is the cost of each type of error?”
Optimize Precision when false positives are expensive:
- Email spam filter (marking good emails as spam loses important messages)
- Content recommendation (showing irrelevant content annoys users)
- Drug approval (approving an ineffective drug wastes resources)
Optimize Recall when false negatives are dangerous:
- Cancer screening (missing a positive case can be fatal)
- Fraud detection (missing fraud means financial loss)
- Security threats (missing a threat has severe consequences)
When both matter equally, optimize F1 score (harmonic mean of precision and recall).
Final Checklist Before Your Interview
Algorithms to Code
- Linear Regression (gradient descent)
- Logistic Regression (with sigmoid)
- Decision Tree (Gini + entropy)
- K-Means (with K-Means++)
- KNN (with distance weighting)
- 2-layer Neural Network
Metrics to Implement
- MSE and RMSE
- Confusion matrix
- Precision, Recall, F1
- AUC-ROC
- K-fold cross-validation
- R-squared score
Data Skills
- Handle missing values
- One-hot and target encoding
- StandardScaler and MinMaxScaler
- Feature engineering (RFM)
- Outlier detection (IQR, z-score)
- Train-test split
Concepts to Explain
- Bias-variance tradeoff
- Overfitting and regularization
- Bagging vs boosting
- Data leakage
- Class imbalance strategies
- Curse of dimensionality
Lilly Tech Systems