ML Coding Round
The ML coding round tests whether you can translate ML concepts into working code under time pressure. Unlike standard software engineering coding rounds, the focus is on ML-specific implementations — not LeetCode-style problems.
What Interviewers Actually Evaluate
Interviewers use a scoring rubric that goes beyond "does it work." Here is what they look for:
| Criteria | Weight | What Strong Looks Like |
|---|---|---|
| ML Knowledge | 30% | Correct algorithm implementation, understands the math behind it, knows when to use it |
| Code Quality | 25% | Clean, readable, well-structured code. Meaningful variable names. Modular functions. |
| Problem Solving | 25% | Clarifies requirements before coding. Breaks problem into steps. Handles edge cases. |
| Communication | 20% | Explains approach before coding. Thinks out loud. Discusses trade-offs when asked. |
The 5 Most Common ML Coding Patterns
These patterns cover approximately 80% of ML coding interview questions. Master all five.
Pattern 1: Implement Gradient Descent from Scratch
Why they ask it: Tests whether you understand optimization — the foundation of all ML training.
import numpy as np
def linear_regression_gradient_descent(X, y, lr=0.01, epochs=1000):
"""
Train a linear regression model using batch gradient descent.
Args:
X: Feature matrix (n_samples, n_features)
y: Target vector (n_samples,)
lr: Learning rate
epochs: Number of training iterations
Returns:
weights: Learned weight vector
bias: Learned bias term
losses: List of MSE loss at each epoch
"""
n_samples, n_features = X.shape
weights = np.zeros(n_features)
bias = 0.0
losses = []
for epoch in range(epochs):
# Forward pass: compute predictions
y_pred = X @ weights + bias
# Compute loss (MSE)
loss = np.mean((y_pred - y) ** 2)
losses.append(loss)
# Compute gradients
error = y_pred - y
dw = (2 / n_samples) * (X.T @ error)
db = (2 / n_samples) * np.sum(error)
# Update parameters
weights -= lr * dw
bias -= lr * db
return weights, bias, losses
# INTERVIEWER FOLLOW-UPS:
# 1. "How would you add L2 regularization?"
# Add lambda * weights to dw, and lambda * ||w||^2 to loss
#
# 2. "How would you convert this to stochastic GD?"
# Sample a random mini-batch each iteration instead of using all data
#
# 3. "When would gradient descent fail?"
# Non-convex loss surfaces, learning rate too high/low, feature scaling issues
Pattern 2: Build a Decision Tree Classifier
Why they ask it: Tests understanding of recursive algorithms, information theory, and tree-based models.
import numpy as np
from collections import Counter
class DecisionTreeNode:
def __init__(self, feature=None, threshold=None,
left=None, right=None, value=None):
self.feature = feature # Index of feature to split on
self.threshold = threshold # Threshold value for split
self.left = left # Left subtree
self.right = right # Right subtree
self.value = value # Leaf prediction (class label)
class SimpleDecisionTree:
def __init__(self, max_depth=10, min_samples=2):
self.max_depth = max_depth
self.min_samples = min_samples
self.root = None
def _gini(self, y):
"""Compute Gini impurity."""
counts = Counter(y)
n = len(y)
return 1.0 - sum((c / n) ** 2 for c in counts.values())
def _best_split(self, X, y):
"""Find the best feature and threshold to split on."""
best_gain = -1
best_feature, best_threshold = None, None
parent_gini = self._gini(y)
n = len(y)
for feature_idx in range(X.shape[1]):
thresholds = np.unique(X[:, feature_idx])
for threshold in thresholds:
left_mask = X[:, feature_idx] <= threshold
right_mask = ~left_mask
if sum(left_mask) == 0 or sum(right_mask) == 0:
continue
# Weighted Gini of children
left_gini = self._gini(y[left_mask])
right_gini = self._gini(y[right_mask])
weighted = (sum(left_mask) / n * left_gini +
sum(right_mask) / n * right_gini)
gain = parent_gini - weighted
if gain > best_gain:
best_gain = gain
best_feature = feature_idx
best_threshold = threshold
return best_feature, best_threshold, best_gain
def _build(self, X, y, depth):
"""Recursively build the tree."""
# Stopping conditions
if (depth >= self.max_depth or
len(y) < self.min_samples or
len(set(y)) == 1):
return DecisionTreeNode(
value=Counter(y).most_common(1)[0][0]
)
feature, threshold, gain = self._best_split(X, y)
if gain <= 0:
return DecisionTreeNode(
value=Counter(y).most_common(1)[0][0]
)
left_mask = X[:, feature] <= threshold
left = self._build(X[left_mask], y[left_mask], depth + 1)
right = self._build(X[~left_mask], y[~left_mask], depth + 1)
return DecisionTreeNode(feature=feature,
threshold=threshold,
left=left, right=right)
def fit(self, X, y):
self.root = self._build(X, y, depth=0)
def _predict_one(self, x, node):
if node.value is not None:
return node.value
if x[node.feature] <= node.threshold:
return self._predict_one(x, node.left)
return self._predict_one(x, node.right)
def predict(self, X):
return np.array([self._predict_one(x, self.root) for x in X])
Pattern 3: Build a Data Processing Pipeline
Why they ask it: Real ML work is 80% data. This tests your ability to handle messy, real-world data.
import numpy as np
import pandas as pd
def build_feature_pipeline(df, target_col, categorical_cols,
numerical_cols):
"""
Build a complete feature engineering pipeline.
Args:
df: Raw DataFrame
target_col: Name of target column
categorical_cols: List of categorical column names
numerical_cols: List of numerical column names
Returns:
X: Processed feature matrix
y: Target vector
pipeline_config: Dict of parameters for inference
"""
pipeline_config = {}
# Step 1: Handle missing values
for col in numerical_cols:
median_val = df[col].median()
df[col] = df[col].fillna(median_val)
pipeline_config[f'{col}_median'] = median_val
for col in categorical_cols:
mode_val = df[col].mode()[0]
df[col] = df[col].fillna(mode_val)
pipeline_config[f'{col}_mode'] = mode_val
# Step 2: Encode categorical variables
encoded_frames = []
for col in categorical_cols:
dummies = pd.get_dummies(df[col], prefix=col, drop_first=True)
encoded_frames.append(dummies)
pipeline_config[f'{col}_categories'] = list(dummies.columns)
# Step 3: Scale numerical features
for col in numerical_cols:
mean_val = df[col].mean()
std_val = df[col].std()
df[col] = (df[col] - mean_val) / (std_val + 1e-8)
pipeline_config[f'{col}_mean'] = mean_val
pipeline_config[f'{col}_std'] = std_val
# Step 4: Combine features
numerical_features = df[numerical_cols]
X = pd.concat([numerical_features] + encoded_frames, axis=1)
y = df[target_col].values
return X.values, y, pipeline_config
# KEY INTERVIEWER QUESTIONS:
# 1. "Why do you save the pipeline_config?"
# For inference — you must apply the SAME transformations
# (same medians, means, categories) to new data.
#
# 2. "What's wrong with using the test set statistics?"
# Data leakage — the model learns information from the test set.
#
# 3. "How would you handle a category at inference time
# that wasn't in the training data?"
# Ignore it (treat as all zeros) or use a catch-all "other" bucket.
Pattern 4: Implement K-Means Clustering
Why they ask it: Tests understanding of unsupervised learning, iterative algorithms, and convergence.
import numpy as np
def kmeans(X, k, max_iters=100, tol=1e-4):
"""
Implement K-Means clustering from scratch.
Args:
X: Data matrix (n_samples, n_features)
k: Number of clusters
max_iters: Maximum iterations
tol: Convergence tolerance
Returns:
centroids: Final cluster centers (k, n_features)
labels: Cluster assignment for each point (n_samples,)
"""
n_samples, n_features = X.shape
# Initialize centroids using random data points
random_indices = np.random.choice(n_samples, k, replace=False)
centroids = X[random_indices].copy()
for iteration in range(max_iters):
# Step 1: Assign each point to nearest centroid
distances = np.zeros((n_samples, k))
for i in range(k):
distances[:, i] = np.linalg.norm(X - centroids[i], axis=1)
labels = np.argmin(distances, axis=1)
# Step 2: Update centroids
new_centroids = np.zeros_like(centroids)
for i in range(k):
cluster_points = X[labels == i]
if len(cluster_points) > 0:
new_centroids[i] = cluster_points.mean(axis=0)
else:
# Handle empty cluster: reinitialize randomly
new_centroids[i] = X[np.random.randint(n_samples)]
# Step 3: Check for convergence
shift = np.linalg.norm(new_centroids - centroids)
centroids = new_centroids
if shift < tol:
break
return centroids, labels
# FOLLOW-UP: "What are the limitations of K-Means?"
# 1. Must specify k in advance (use elbow method or silhouette)
# 2. Sensitive to initialization (use k-means++ instead)
# 3. Assumes spherical clusters (fails on non-convex shapes)
# 4. Sensitive to outliers (consider k-medoids instead)
Pattern 5: Implement a Simple Neural Network
Why they ask it: Tests understanding of backpropagation, activation functions, and the training loop.
import numpy as np
class SimpleNeuralNetwork:
"""Two-layer neural network for binary classification."""
def __init__(self, input_size, hidden_size, lr=0.01):
# Xavier initialization
self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / input_size)
self.b1 = np.zeros(hidden_size)
self.W2 = np.random.randn(hidden_size, 1) * np.sqrt(2.0 / hidden_size)
self.b2 = np.zeros(1)
self.lr = lr
def _sigmoid(self, z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def _relu(self, z):
return np.maximum(0, z)
def _relu_derivative(self, z):
return (z > 0).astype(float)
def forward(self, X):
self.z1 = X @ self.W1 + self.b1
self.a1 = self._relu(self.z1)
self.z2 = self.a1 @ self.W2 + self.b2
self.a2 = self._sigmoid(self.z2)
return self.a2
def backward(self, X, y):
n = X.shape[0]
y = y.reshape(-1, 1)
# Output layer gradients
dz2 = self.a2 - y # (n, 1)
dW2 = (1 / n) * (self.a1.T @ dz2) # (hidden, 1)
db2 = (1 / n) * np.sum(dz2, axis=0) # (1,)
# Hidden layer gradients
da1 = dz2 @ self.W2.T # (n, hidden)
dz1 = da1 * self._relu_derivative(self.z1) # (n, hidden)
dW1 = (1 / n) * (X.T @ dz1) # (input, hidden)
db1 = (1 / n) * np.sum(dz1, axis=0) # (hidden,)
# Update weights
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
def train(self, X, y, epochs=100):
losses = []
for epoch in range(epochs):
y_pred = self.forward(X)
loss = -np.mean(y * np.log(y_pred + 1e-8) +
(1 - y) * np.log(1 - y_pred + 1e-8))
losses.append(loss)
self.backward(X, y)
return losses
def predict(self, X):
return (self.forward(X) >= 0.5).astype(int).flatten()
Coding Style That Impresses Interviewers
Your code style signals your experience level. Follow these conventions:
| Do | Do Not |
|---|---|
| Write docstrings with Args and Returns | Skip documentation entirely |
Use descriptive variable names (n_samples, learning_rate) |
Use single letters (a, b, x1) |
| Handle edge cases (empty arrays, division by zero) | Assume inputs are always clean |
| Use NumPy vectorization | Write nested for-loops over arrays |
| Explain your approach before coding | Code in silence for 30 minutes |
| Test with a simple example at the end | Say "I think it works" without testing |
Time Management Strategy
45-MINUTE ML CODING ROUND BREAKDOWN
=====================================
[0:00 - 0:05] Read problem, ask clarifying questions
- "Can I assume the data fits in memory?"
- "Should I handle multi-class or just binary?"
- "Is there a specific metric I should optimize?"
[0:05 - 0:10] Outline approach (pseudocode or verbal)
- "I'll implement this in three steps..."
- Get interviewer buy-in before coding
[0:10 - 0:35] Write code
- Start with the main function signature
- Build incrementally (get a basic version first)
- Comment non-obvious steps
[0:35 - 0:40] Test with a simple example
- Walk through your code with a 3-4 row dataset
- Verify the output makes sense
[0:40 - 0:45] Discuss extensions and trade-offs
- "If I had more time, I would add..."
- "The time complexity is O(n*k*features)..."
- "In production, I would use sklearn but..."
Lilly Tech Systems