Advanced

Practice Problems & Tips

Put everything together with 10 timed practice problems that mirror real interview questions, learn to debug common ML code bugs, and get answers to the most frequently asked interview questions.

10 Timed Practice Problems

Set a timer for each problem. The time limit reflects what you would get in a real interview. Try solving each problem before revealing the solution.

💡
How to practice: Open a blank Python file or Jupyter notebook. Set a timer. Write the solution without looking at the answer. Only check the solution after your timer expires. This builds the muscle memory you need for real interviews.

Problem 1: Normalize a Dataset (10 minutes)

📝
Question: Write a function that takes a 2D NumPy array and normalizes each feature (column) to have zero mean and unit variance. Do not use sklearn. Handle the case where a feature has zero variance.
Reveal Solution
import numpy as np

def normalize(X):
    """Normalize features to zero mean and unit variance."""
    mean = np.mean(X, axis=0)
    std = np.std(X, axis=0)
    std[std == 0] = 1.0  # Avoid division by zero
    return (X - mean) / std

# Test
X = np.array([[1, 200], [2, 400], [3, 100], [4, 300], [5, 500]])
X_norm = normalize(X.astype(float))
print(f"Means: {np.mean(X_norm, axis=0)}")   # [0, 0]
print(f"Stds:  {np.std(X_norm, axis=0)}")     # [1, 1]

Problem 2: Train-Test Split (10 minutes)

📝
Question: Implement train_test_split(X, y, test_size=0.2, random_state=42) from scratch. Ensure X and y are shuffled together and the split respects the test_size ratio.
Reveal Solution
def train_test_split(X, y, test_size=0.2, random_state=None):
    """Split data into training and test sets."""
    n = len(X)
    indices = np.arange(n)

    if random_state is not None:
        rng = np.random.RandomState(random_state)
        rng.shuffle(indices)
    else:
        np.random.shuffle(indices)

    split_idx = int(n * (1 - test_size))

    X_train = X[indices[:split_idx]]
    X_test = X[indices[split_idx:]]
    y_train = y[indices[:split_idx]]
    y_test = y[indices[split_idx:]]

    return X_train, X_test, y_train, y_test

# Test
X = np.arange(100).reshape(50, 2)
y = np.arange(50)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Train: {len(X_tr)}, Test: {len(X_te)}")  # 40, 10

Problem 3: Softmax Function (10 minutes)

📝
Question: Implement a numerically stable softmax function that works on both 1D and 2D arrays. Each row should sum to 1 for 2D inputs. Explain what makes it numerically stable.
Reveal Solution
def softmax(z):
    """Numerically stable softmax."""
    z = np.atleast_2d(z)
    # Subtract max per row for numerical stability
    # This prevents exp() overflow without changing the result
    z_shifted = z - np.max(z, axis=1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)

# Test
logits = np.array([[1000, 1001, 1002]])  # Would overflow without stability trick
print(softmax(logits))  # Should work, sum to 1

logits_2d = np.array([[1, 2, 3], [4, 5, 6]])
result = softmax(logits_2d)
print(result)
print(f"Row sums: {np.sum(result, axis=1)}")  # [1.0, 1.0]

Why stable: Subtracting the max prevents exp() from producing infinity. Mathematically: softmax(z - c) = softmax(z) for any constant c.

Problem 4: Gradient Checking (15 minutes)

📝
Question: Implement numerical gradient checking to verify your analytical gradients. Write a function that computes the gradient numerically using finite differences and compares it to the analytical gradient.
Reveal Solution
def numerical_gradient(f, x, eps=1e-5):
    """
    Compute numerical gradient using central differences.
    f: function that takes x and returns a scalar loss
    x: numpy array of parameters
    """
    grad = np.zeros_like(x)

    for i in range(len(x)):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus[i] += eps
        x_minus[i] -= eps

        grad[i] = (f(x_plus) - f(x_minus)) / (2 * eps)

    return grad


def gradient_check(analytical_grad, f, x, eps=1e-5, threshold=1e-5):
    """
    Compare analytical and numerical gradients.
    Returns True if they match within threshold.
    """
    num_grad = numerical_gradient(f, x, eps)

    # Relative difference
    diff = np.linalg.norm(analytical_grad - num_grad) / \
           (np.linalg.norm(analytical_grad) + np.linalg.norm(num_grad) + 1e-15)

    passed = diff < threshold
    print(f"Analytical: {analytical_grad}")
    print(f"Numerical:  {num_grad}")
    print(f"Relative difference: {diff:.2e}")
    print(f"Gradient check: {'PASSED' if passed else 'FAILED'}")

    return passed


# Test: verify gradient of f(w) = w^2 (gradient should be 2w)
def loss_fn(w):
    return np.sum(w ** 2)

w = np.array([3.0, -2.0, 5.0])
analytical = 2 * w
gradient_check(analytical, loss_fn, w)  # Should pass

Problem 5: Implement Batch Gradient Descent vs Mini-Batch (15 minutes)

📝
Question: Modify the linear regression implementation to support mini-batch gradient descent. Accept a batch_size parameter. Explain the tradeoffs between batch, mini-batch, and stochastic gradient descent.
Reveal Solution
class LinearRegressionMiniBatch:
    def __init__(self, lr=0.01, n_iter=100, batch_size=32):
        self.lr = lr
        self.n_iter = n_iter
        self.batch_size = batch_size
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0.0

        for epoch in range(self.n_iter):
            # Shuffle data at each epoch
            indices = np.random.permutation(n_samples)
            X_shuffled = X[indices]
            y_shuffled = y[indices]

            # Process mini-batches
            for start in range(0, n_samples, self.batch_size):
                end = min(start + self.batch_size, n_samples)
                X_batch = X_shuffled[start:end]
                y_batch = y_shuffled[start:end]
                batch_n = len(X_batch)

                y_pred = X_batch @ self.weights + self.bias
                dw = (1 / batch_n) * (X_batch.T @ (y_pred - y_batch))
                db = (1 / batch_n) * np.sum(y_pred - y_batch)

                self.weights -= self.lr * dw
                self.bias -= self.lr * db

        return self

    def predict(self, X):
        return X @ self.weights + self.bias

# batch_size = n_samples -> Batch GD (smooth but slow)
# batch_size = 1 -> Stochastic GD (noisy but fast)
# batch_size = 32 or 64 -> Mini-batch (balanced)

X = 2 * np.random.rand(200, 1)
y = 4 + 3 * X.squeeze() + np.random.randn(200) * 0.5

model = LinearRegressionMiniBatch(lr=0.1, n_iter=50, batch_size=32)
model.fit(X, y)
print(f"Weight: {model.weights[0]:.4f}, Bias: {model.bias:.4f}")

Problem 6: Implement Polynomial Features (10 minutes)

📝
Question: Write a function that generates polynomial features up to a given degree. For input [x1, x2] and degree 2, generate [1, x1, x2, x1^2, x1*x2, x2^2].
Reveal Solution
from itertools import combinations_with_replacement

def polynomial_features(X, degree=2):
    """
    Generate polynomial features up to the given degree.
    Includes bias term (column of 1s) and interaction terms.
    """
    n_samples, n_features = X.shape
    features = [np.ones(n_samples)]  # Bias term

    # Generate all combinations with replacement
    for d in range(1, degree + 1):
        for combo in combinations_with_replacement(range(n_features), d):
            feature = np.ones(n_samples)
            for idx in combo:
                feature *= X[:, idx]
            features.append(feature)

    return np.column_stack(features)

# Test
X = np.array([[2, 3], [4, 5]])
X_poly = polynomial_features(X, degree=2)
print(f"Shape: {X_poly.shape}")  # (2, 6)
print(f"Features for [2,3]: {X_poly[0]}")
# Should be [1, 2, 3, 4, 6, 9] = [1, x1, x2, x1^2, x1*x2, x2^2]

Problem 7: Implement Cosine Similarity Search (15 minutes)

📝
Question: Given a query vector and a matrix of document embeddings, return the top-K most similar documents by cosine similarity. This is a common question at companies working with search or recommendations.
Reveal Solution
def cosine_similarity_search(query, documents, top_k=5):
    """
    Find the top-K most similar documents to the query.

    Args:
        query: 1D array of shape (d,)
        documents: 2D array of shape (n_docs, d)
        top_k: number of results to return

    Returns:
        indices: indices of top-K documents
        similarities: cosine similarity scores
    """
    # Normalize query
    query_norm = query / (np.linalg.norm(query) + 1e-15)

    # Normalize documents (each row)
    doc_norms = np.linalg.norm(documents, axis=1, keepdims=True)
    doc_norms = np.maximum(doc_norms, 1e-15)  # Avoid division by zero
    docs_normalized = documents / doc_norms

    # Cosine similarity = dot product of normalized vectors
    similarities = docs_normalized @ query_norm

    # Get top-K indices (descending)
    top_indices = np.argsort(-similarities)[:top_k]

    return top_indices, similarities[top_indices]

# Test
np.random.seed(42)
query = np.array([1.0, 0.0, 1.0, 0.0])
documents = np.random.randn(100, 4)
documents[42] = query * 5  # Make doc 42 very similar

indices, scores = cosine_similarity_search(query, documents, top_k=3)
print(f"Top indices: {indices}")  # 42 should be first
print(f"Scores: {scores}")

Problem 8: Implement TF-IDF (20 minutes)

📝
Question: Implement TF-IDF (Term Frequency-Inverse Document Frequency) from scratch. Given a list of documents (strings), compute the TF-IDF matrix.
Reveal Solution
from collections import Counter
import math

def compute_tfidf(documents):
    """
    Compute TF-IDF matrix from scratch.

    Args:
        documents: list of strings

    Returns:
        tfidf_matrix: 2D numpy array (n_docs, n_vocab)
        vocab: list of terms
    """
    n_docs = len(documents)

    # Tokenize
    tokenized = [doc.lower().split() for doc in documents]

    # Build vocabulary
    vocab = sorted(set(word for doc in tokenized for word in doc))
    word_to_idx = {word: i for i, word in enumerate(vocab)}

    # Compute document frequency (DF) for each term
    df = Counter()
    for doc in tokenized:
        unique_words = set(doc)
        for word in unique_words:
            df[word] += 1

    # Compute TF-IDF
    tfidf_matrix = np.zeros((n_docs, len(vocab)))

    for doc_idx, doc in enumerate(tokenized):
        # Term frequency
        tf = Counter(doc)
        n_terms = len(doc)

        for word, count in tf.items():
            col_idx = word_to_idx[word]
            # TF: count / total terms in doc
            tf_val = count / n_terms
            # IDF: log(N / df) + 1 (smoothed)
            idf_val = math.log(n_docs / df[word]) + 1
            tfidf_matrix[doc_idx, col_idx] = tf_val * idf_val

    return tfidf_matrix, vocab

# Test
docs = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "cats and dogs are friends"
]
matrix, vocab = compute_tfidf(docs)
print(f"Vocab: {vocab}")
print(f"TF-IDF shape: {matrix.shape}")
print(f"TF-IDF for doc 0:\n{dict(zip(vocab, matrix[0].round(3)))}")

Problem 9: Detect Data Leakage (15 minutes)

📝
Question: The following ML pipeline has a data leakage bug. Find and fix it. Explain why this leakage would cause overly optimistic performance estimates.
# BUGGY CODE - Find the leakage!
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np

X = np.random.randn(1000, 10)
y = np.random.randint(0, 2, 1000)

# Bug 1: Scaling BEFORE splitting
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # LEAKAGE: test info in train

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2
)

# Bug 2: Feature selection using ALL data
correlations = np.abs([np.corrcoef(X_scaled[:, i], y)[0,1]
                       for i in range(10)])
top_features = np.argsort(-correlations)[:5]  # LEAKAGE

model = LogisticRegression()
model.fit(X_train[:, top_features], y_train)
score = model.score(X_test[:, top_features], y_test)
print(f"Test accuracy: {score:.2%}")
Reveal Solution
# FIXED CODE
X = np.random.randn(1000, 10)
y = np.random.randint(0, 2, 1000)

# Fix 1: Split FIRST, then scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fit scaler on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit + transform
X_test_scaled = scaler.transform(X_test)          # transform only!

# Fix 2: Feature selection using ONLY training data
correlations = np.abs([np.corrcoef(X_train_scaled[:, i], y_train)[0,1]
                       for i in range(10)])
top_features = np.argsort(-correlations)[:5]

model = LogisticRegression()
model.fit(X_train_scaled[:, top_features], y_train)
score = model.score(X_test_scaled[:, top_features], y_test)
print(f"Test accuracy: {score:.2%}")

# Why this matters:
# Bug 1: StandardScaler learned mean/std from ALL data including test.
#         The training data's scaling is influenced by test samples.
# Bug 2: Feature selection used correlations from ALL data.
#         Features were chosen knowing test set patterns.
# Both leaks make test performance artificially high.

Problem 10: Full Pipeline - End to End (30 minutes)

📝
Question: Build a complete ML pipeline from scratch: load data, handle missing values, encode categoricals, normalize features, train a logistic regression model, and evaluate with precision/recall/F1. Do not use sklearn for the model or evaluation — only for loading data.
Reveal Solution
import numpy as np

# Step 1: Create synthetic dataset
np.random.seed(42)
n = 300
X_numeric = np.random.randn(n, 3)
X_numeric[np.random.rand(n, 3) < 0.1] = np.nan  # Add missing values

categories = np.random.choice(['A', 'B', 'C'], n)
y = ((X_numeric[:, 0] > 0) if not np.any(np.isnan(X_numeric[:, 0]))
     else np.random.randint(0, 2, n)).astype(int)
# Fix: create clean labels
y = np.random.randint(0, 2, n)

# Step 2: Handle missing values (mean imputation)
def impute_mean(X):
    X_clean = X.copy()
    for col in range(X_clean.shape[1]):
        mask = np.isnan(X_clean[:, col])
        if mask.any():
            col_mean = np.nanmean(X_clean[:, col])
            X_clean[mask, col] = col_mean
    return X_clean

X_imputed = impute_mean(X_numeric)

# Step 3: One-hot encode categoricals
unique_cats = sorted(set(categories))
one_hot = np.zeros((n, len(unique_cats)))
for i, cat in enumerate(categories):
    one_hot[i, unique_cats.index(cat)] = 1

# Combine
X_full = np.hstack([X_imputed, one_hot])

# Step 4: Train-test split
split = int(0.8 * n)
indices = np.random.permutation(n)
X_train, X_test = X_full[indices[:split]], X_full[indices[split:]]
y_train, y_test = y[indices[:split]], y[indices[split:]]

# Step 5: Normalize (fit on train only!)
train_mean = np.mean(X_train, axis=0)
train_std = np.std(X_train, axis=0)
train_std[train_std == 0] = 1.0
X_train_norm = (X_train - train_mean) / train_std
X_test_norm = (X_test - train_mean) / train_std

# Step 6: Train logistic regression from scratch
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

weights = np.zeros(X_train_norm.shape[1])
bias = 0.0
lr = 0.1

for _ in range(1000):
    z = X_train_norm @ weights + bias
    pred = sigmoid(z)
    dw = (1 / len(X_train_norm)) * (X_train_norm.T @ (pred - y_train))
    db = (1 / len(X_train_norm)) * np.sum(pred - y_train)
    weights -= lr * dw
    bias -= lr * db

# Step 7: Evaluate
y_pred = (sigmoid(X_test_norm @ weights + bias) >= 0.5).astype(int)
tp = np.sum((y_pred == 1) & (y_test == 1))
fp = np.sum((y_pred == 1) & (y_test == 0))
fn = np.sum((y_pred == 0) & (y_test == 1))

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
accuracy = np.mean(y_pred == y_test)

print(f"Accuracy:  {accuracy:.2%}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1:.4f}")

Debugging ML Code: Common Bugs

Interviewers sometimes give you broken code and ask you to find the bug. Here are the most common ML code bugs:

BugSymptomFix
Shape mismatch in matrix multiplyValueError: shapes not alignedPrint shapes before every @ operation
Forgetting to reshape yLoss stays constant or NaNEnsure y shape matches output: y.reshape(-1, 1)
Wrong sign in gradientLoss increases each iterationGradient should be X.T @ (pred - y), not X.T @ (y - pred)
Missing bias updateModel has systematic offsetAlways update both weights and bias
Learning rate too highLoss oscillates or explodes to NaNStart with 0.001 and increase gradually
Not normalizing featuresGradient descent converges very slowlyStandardize features before training
Data leakageTrain accuracy high, test accuracy much lowerFit preprocessing on train only, transform test
Integer division in Python 2 styleGradients are all zeroUse 1.0 / n or float() conversions
Not clipping sigmoid inputRuntimeWarning: overflow in expUse np.clip(z, -500, 500)
Forgetting epsilon in logRuntimeWarning: divide by zero in logUse np.log(x + 1e-15)

Frequently Asked Questions

Click each question to reveal the answer. These are the most common non-coding questions asked during ML coding interviews.

What is the difference between bagging and boosting?

Bagging (Bootstrap Aggregating) trains multiple models independently on random subsets of data (with replacement) and averages their predictions. It reduces variance. Example: Random Forest.

Boosting trains models sequentially, where each new model focuses on the mistakes of the previous ones. It reduces bias. Example: XGBoost, AdaBoost, Gradient Boosting.

Key difference: Bagging models are independent (can be parallelized). Boosting models are sequential (must be trained in order).

How do you handle class imbalance?

Five main strategies:

  • Oversampling the minority class (SMOTE, random oversampling)
  • Undersampling the majority class
  • Class weights in the loss function (weight rare class higher)
  • Threshold tuning (lower the classification threshold for the rare class)
  • Use appropriate metrics (F1, AUC-ROC, precision-recall AUC instead of accuracy)

Most important: never use accuracy as your metric for imbalanced data. A model that always predicts the majority class gets 99% accuracy on a 99/1 split but is useless.

What is the bias-variance tradeoff?

Bias: error from overly simplistic assumptions. High bias = underfitting (model too simple).

Variance: error from sensitivity to training data fluctuations. High variance = overfitting (model too complex).

The tradeoff: as model complexity increases, bias decreases but variance increases. The sweet spot is the complexity that minimizes total error (bias^2 + variance).

Practical examples: Linear regression has high bias, low variance. A deep decision tree has low bias, high variance. Random forests reduce variance through bagging while maintaining low bias.

When would you use L1 vs L2 regularization?

L1 (Lasso): Pushes some weights exactly to zero, effectively performing feature selection. Use when you suspect many features are irrelevant and want a sparse model.

L2 (Ridge): Shrinks all weights toward zero but never exactly zero. Use when all features are potentially relevant and you want to prevent any single feature from dominating.

Elastic Net: Combines L1 and L2. Use when you want some feature selection but also want to handle correlated features (L1 alone picks one arbitrarily).

How do you detect and prevent overfitting?

Detection:

  • Training accuracy much higher than validation accuracy
  • Validation loss increases while training loss decreases (divergence)
  • Model performs well on training data but poorly on new data

Prevention:

  • More training data (the single best fix)
  • Regularization (L1, L2, dropout)
  • Early stopping (stop training when validation loss stops improving)
  • Cross-validation for reliable performance estimates
  • Simpler model (fewer features, shallower tree, smaller network)
  • Data augmentation (for images, text)
  • Ensemble methods (bagging reduces variance)
Explain the curse of dimensionality.

As the number of features (dimensions) increases, the volume of the feature space grows exponentially. This has several consequences:

  • Data becomes sparse: points are far from each other in high dimensions
  • Distance metrics become meaningless: in high dimensions, the difference between the nearest and farthest neighbor shrinks
  • More data needed: to maintain the same density of points, you need exponentially more samples
  • KNN degrades: with many irrelevant features, nearest neighbors are dominated by noise dimensions

Solutions: dimensionality reduction (PCA, t-SNE), feature selection, regularization, domain knowledge to select relevant features.

What is gradient descent, and what are its variants?

Gradient descent is an optimization algorithm that iteratively adjusts parameters in the direction of steepest descent of the loss function.

Batch GD: Uses all training samples per update. Smooth convergence but slow for large datasets. Memory intensive.

Stochastic GD (SGD): Uses one sample per update. Very fast but noisy. Can escape local minima due to noise.

Mini-batch GD: Uses a subset (32–256 samples) per update. Best of both worlds. Standard in practice.

Advanced optimizers: Adam (adaptive learning rates per parameter), RMSprop (running average of squared gradients), Adagrad (accumulates past gradients). Adam is the most common default in deep learning.

How would you handle a feature with 1000 categories?

One-hot encoding would create 1000 columns, which is impractical. Better approaches:

  • Target encoding: Replace each category with the mean of the target variable (with smoothing to prevent overfitting)
  • Frequency encoding: Replace each category with its count or frequency in the dataset
  • Embedding: Learn a dense vector representation (common in deep learning)
  • Group rare categories: Combine categories that appear fewer than N times into an “other” category
  • Hashing trick: Hash categories into a fixed number of buckets (some collisions but bounded dimensionality)
What is the difference between generative and discriminative models?

Discriminative models learn the decision boundary P(y|X) directly. Examples: logistic regression, SVM, neural networks, decision trees. They answer: “Given these features, which class?”

Generative models learn the joint distribution P(X, y) = P(X|y) * P(y). Examples: Naive Bayes, Gaussian Mixture Models, Hidden Markov Models. They can also generate new data samples.

Discriminative models generally achieve better classification accuracy with enough data. Generative models are useful when you have limited data, need to handle missing features, or want to generate synthetic data.

How do you choose between precision and recall as your optimization target?

Ask: “What is the cost of each type of error?”

Optimize Precision when false positives are expensive:

  • Email spam filter (marking good emails as spam loses important messages)
  • Content recommendation (showing irrelevant content annoys users)
  • Drug approval (approving an ineffective drug wastes resources)

Optimize Recall when false negatives are dangerous:

  • Cancer screening (missing a positive case can be fatal)
  • Fraud detection (missing fraud means financial loss)
  • Security threats (missing a threat has severe consequences)

When both matter equally, optimize F1 score (harmonic mean of precision and recall).


Final Checklist Before Your Interview

Algorithms to Code

  • Linear Regression (gradient descent)
  • Logistic Regression (with sigmoid)
  • Decision Tree (Gini + entropy)
  • K-Means (with K-Means++)
  • KNN (with distance weighting)
  • 2-layer Neural Network
📊

Metrics to Implement

  • MSE and RMSE
  • Confusion matrix
  • Precision, Recall, F1
  • AUC-ROC
  • K-fold cross-validation
  • R-squared score
🛠

Data Skills

  • Handle missing values
  • One-hot and target encoding
  • StandardScaler and MinMaxScaler
  • Feature engineering (RFM)
  • Outlier detection (IQR, z-score)
  • Train-test split
💬

Concepts to Explain

  • Bias-variance tradeoff
  • Overfitting and regularization
  • Bagging vs boosting
  • Data leakage
  • Class imbalance strategies
  • Curse of dimensionality
💡
Good luck! If you can implement every algorithm in this course from scratch and explain the concepts in the FAQ section, you are well-prepared for ML coding interviews at any company. Remember: the goal is not perfection — it is demonstrating clear thinking, solid fundamentals, and the ability to write clean code under pressure.