Intermediate

Implement Linear & Logistic Regression

The most frequently asked ML coding question. Build both models from scratch with NumPy, understand every line of the gradient descent update, and learn how to add regularization when the interviewer asks the follow-up.

📝

Interview Question #1: “Implement linear regression from scratch using only NumPy. Your class should have fit(X, y) and predict(X) methods. Use gradient descent for optimization.”

Linear Regression: Complete Solution

The Math You Must Know

Linear regression models the relationship y = Xw + b where:

X is the feature matrix of shape (n_samples, n_features)
w is the weight vector of shape (n_features,)
b is the bias (scalar)
y is the target vector of shape (n_samples,)

The loss function is Mean Squared Error (MSE):

L = (1 / 2n) * sum((y_pred - y_true)^2)

The gradients are:

dL/dw = (1/n) * X.T @ (y_pred - y_true)
dL/db = (1/n) * sum(y_pred - y_true)

Full Implementation

import numpy as np

class LinearRegression:
    """
    Linear regression using gradient descent.
    Interview-ready implementation with full documentation.
    """

    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.weights = None
        self.bias = None
        self.loss_history = []

    def fit(self, X, y):
        """
        Train the model using gradient descent.

        Parameters:
            X: np.ndarray of shape (n_samples, n_features)
            y: np.ndarray of shape (n_samples,)
        """
        n_samples, n_features = X.shape

        # Initialize parameters
        self.weights = np.zeros(n_features)
        self.bias = 0.0

        for i in range(self.n_iter):
            # Forward pass
            y_pred = X @ self.weights + self.bias  # (n_samples,)

            # Compute loss (MSE)
            loss = np.mean((y_pred - y) ** 2) / 2
            self.loss_history.append(loss)

            # Compute gradients
            dw = (1 / n_samples) * (X.T @ (y_pred - y))  # (n_features,)
            db = (1 / n_samples) * np.sum(y_pred - y)      # scalar

            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

        return self

    def predict(self, X):
        """Predict target values for input X."""
        return X @ self.weights + self.bias

    def score(self, X, y):
        """Return R-squared score."""
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - (ss_res / ss_tot)


# ---- Test with known data ----
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X.squeeze() + np.random.randn(100) * 0.5  # y = 4 + 3x + noise

model = LinearRegression(learning_rate=0.1, n_iterations=1000)
model.fit(X, y)

print(f"Weight: {model.weights[0]:.4f}")  # Should be close to 3.0
print(f"Bias:   {model.bias:.4f}")        # Should be close to 4.0
print(f"R^2:    {model.score(X, y):.4f}") # Should be close to 1.0

What interviewers look for:

Correct gradient computation — the transpose of X multiplied by the error, divided by n
Proper shape handling — no accidental broadcasting errors
Loss tracking — shows you understand convergence
Clean API matching scikit-learn conventions (fit/predict/score)

⚠

Common follow-up: “Your gradient descent is not converging. What could be wrong?” Answer: (1) Learning rate too high — try reducing by 10x. (2) Features not normalized — different scales cause zigzagging. (3) Too few iterations. Always mention feature normalization.

📝

Interview Question #2: “Now implement logistic regression for binary classification. Include the sigmoid function and binary cross-entropy loss. Use the same gradient descent approach.”

Logistic Regression: Complete Solution

Key Differences from Linear Regression

Aspect	Linear Regression	Logistic Regression
Output	Continuous value	Probability (0 to 1)
Activation	None (identity)	Sigmoid function
Loss function	Mean Squared Error	Binary Cross-Entropy
Gradient formula	Same structure	Same structure (surprisingly)

Full Implementation

import numpy as np

class LogisticRegression:
    """
    Logistic regression using gradient descent.
    Handles binary classification with sigmoid activation.
    """

    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.weights = None
        self.bias = None
        self.loss_history = []

    def _sigmoid(self, z):
        """
        Numerically stable sigmoid function.
        Avoids overflow by clipping input values.
        """
        z = np.clip(z, -500, 500)  # Prevent overflow
        return 1.0 / (1.0 + np.exp(-z))

    def fit(self, X, y):
        """
        Train logistic regression using gradient descent.

        Parameters:
            X: np.ndarray of shape (n_samples, n_features)
            y: np.ndarray of shape (n_samples,) with values 0 or 1
        """
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0.0

        for i in range(self.n_iter):
            # Forward pass
            z = X @ self.weights + self.bias    # (n_samples,)
            y_pred = self._sigmoid(z)           # (n_samples,)

            # Binary cross-entropy loss
            # Add epsilon to prevent log(0)
            eps = 1e-15
            loss = -np.mean(
                y * np.log(y_pred + eps) +
                (1 - y) * np.log(1 - y_pred + eps)
            )
            self.loss_history.append(loss)

            # Gradients (same form as linear regression!)
            dw = (1 / n_samples) * (X.T @ (y_pred - y))
            db = (1 / n_samples) * np.sum(y_pred - y)

            # Update
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

        return self

    def predict_proba(self, X):
        """Return probability of class 1."""
        z = X @ self.weights + self.bias
        return self._sigmoid(z)

    def predict(self, X, threshold=0.5):
        """Return class labels (0 or 1)."""
        return (self.predict_proba(X) >= threshold).astype(int)

    def accuracy(self, X, y):
        """Return classification accuracy."""
        return np.mean(self.predict(X) == y)


# ---- Test with separable data ----
np.random.seed(42)
# Class 0: centered at (-1, -1), Class 1: centered at (1, 1)
X_class0 = np.random.randn(50, 2) + np.array([-1, -1])
X_class1 = np.random.randn(50, 2) + np.array([1, 1])
X = np.vstack([X_class0, X_class1])
y = np.array([0] * 50 + [1] * 50)

model = LogisticRegression(learning_rate=0.1, n_iterations=1000)
model.fit(X, y)

print(f"Weights: {model.weights}")
print(f"Bias:    {model.bias:.4f}")
print(f"Accuracy: {model.accuracy(X, y):.2%}")  # Should be >90%
print(f"Predict [2,2]: {model.predict(np.array([[2, 2]]))}")  # Should be 1
print(f"Predict [-2,-2]: {model.predict(np.array([[-2, -2]]))}")  # Should be 0

What interviewers look for:

Numerical stability in the sigmoid — clipping z to prevent exp overflow
Epsilon in log — preventing log(0) which gives -infinity
Understanding that the gradient formula is identical to linear regression (a non-obvious insight)
Separate predict_proba and predict methods showing understanding of thresholds

📝

Interview Question #3 (Follow-up): “Add L2 regularization (Ridge) to your logistic regression. Explain why regularization helps and what happens as lambda increases.”

Adding L2 Regularization

Regularization prevents overfitting by penalizing large weights. The regularized loss is:

L_reg = L + (lambda / 2n) * sum(w^2)

The gradient changes to include the regularization term:

dw_reg = dw + (lambda / n) * w

class LogisticRegressionL2:
    """Logistic regression with L2 (Ridge) regularization."""

    def __init__(self, learning_rate=0.01, n_iterations=1000, reg_lambda=0.1):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.reg_lambda = reg_lambda
        self.weights = None
        self.bias = None

    def _sigmoid(self, z):
        z = np.clip(z, -500, 500)
        return 1.0 / (1.0 + np.exp(-z))

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0.0

        for _ in range(self.n_iter):
            z = X @ self.weights + self.bias
            y_pred = self._sigmoid(z)

            # Gradients WITH regularization
            dw = (1 / n_samples) * (X.T @ (y_pred - y)) + \
                 (self.reg_lambda / n_samples) * self.weights
            db = (1 / n_samples) * np.sum(y_pred - y)
            # Note: bias is NOT regularized (standard practice)

            self.weights -= self.lr * dw
            self.bias -= self.lr * db

        return self

    def predict(self, X, threshold=0.5):
        z = X @ self.weights + self.bias
        return (self._sigmoid(z) >= threshold).astype(int)


# Compare regularized vs unregularized
model_noreg = LogisticRegressionL2(learning_rate=0.1, n_iterations=1000, reg_lambda=0.0)
model_reg = LogisticRegressionL2(learning_rate=0.1, n_iterations=1000, reg_lambda=1.0)

model_noreg.fit(X, y)
model_reg.fit(X, y)

print(f"No regularization - weights magnitude: {np.linalg.norm(model_noreg.weights):.4f}")
print(f"With L2 (lambda=1) - weights magnitude: {np.linalg.norm(model_reg.weights):.4f}")
# Regularized weights should have smaller magnitude

💡

Interview tip: When explaining regularization, always mention: (1) It prevents overfitting by constraining weight magnitudes. (2) The bias term is not regularized. (3) As lambda increases, weights shrink toward zero, making the model simpler but potentially underfitting. (4) L1 (Lasso) promotes sparsity while L2 (Ridge) distributes weight across features.

📝

Interview Question #4: “Implement the closed-form solution (Normal Equation) for linear regression. When would you prefer this over gradient descent?”

Normal Equation: Closed-Form Solution

class LinearRegressionClosed:
    """Linear regression using the Normal Equation (closed-form)."""

    def __init__(self):
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        """
        Solve w = (X^T X)^(-1) X^T y directly.
        We prepend a column of ones to X to handle the bias.
        """
        n_samples = X.shape[0]
        # Add bias column
        X_b = np.column_stack([np.ones(n_samples), X])

        # Normal equation: w = (X^T X)^(-1) X^T y
        # Using np.linalg.solve for numerical stability instead of inverse
        XtX = X_b.T @ X_b
        Xty = X_b.T @ y
        theta = np.linalg.solve(XtX, Xty)

        self.bias = theta[0]
        self.weights = theta[1:]
        return self

    def predict(self, X):
        return X @ self.weights + self.bias


# Test - should match gradient descent solution
model_closed = LinearRegressionClosed()
model_closed.fit(X, y)  # Using same X, y from earlier

print(f"Closed-form weight: {model_closed.weights[0]:.4f}")
print(f"Closed-form bias:   {model_closed.bias:.4f}")

When to use Normal Equation vs Gradient Descent:

Factor	Normal Equation	Gradient Descent
Time complexity	O(n*d^2 + d^3)	O(ndk) where k = iterations
Best when	d < 10,000 features	d > 10,000 features
Feature scaling needed?	No	Yes (for convergence)
Hyperparameters	None	Learning rate, iterations
Singular matrix?	Can fail	Always works
Works for logistic?	No (non-linear loss)	Yes

💡

Bonus point: Tell the interviewer you use np.linalg.solve instead of np.linalg.inv because computing the explicit inverse is numerically less stable and slower. This shows real-world engineering awareness.

← Previous What to Expect Next → Decision Trees & Random Forest