Intermediate

Supervised Learning Questions

15 interview questions and model answers covering regression, classification, loss functions, gradient descent, regularization, SVM, and ensemble methods.

Q1: What is the difference between regression and classification?

💡

Model Answer: Regression predicts a continuous numerical output (e.g., house price, temperature, stock price). Classification predicts a discrete categorical label (e.g., spam/not spam, cat/dog, disease type). The key distinction is the nature of the target variable. Some algorithms handle both (decision trees, neural networks), while others are specialized (linear regression vs logistic regression). The choice of loss function differs: regression typically uses MSE or MAE, while classification uses cross-entropy or hinge loss. Some problems can be framed either way — for example, predicting credit scores (regression) vs predicting default/no-default (classification).

Q2: Explain the logistic regression model. Is it really regression?

💡

Model Answer: Despite its name, logistic regression is a classification algorithm. It models the probability of a binary outcome by applying the sigmoid function to a linear combination of inputs: P(y=1|x) = 1/(1 + e^(-w·x + b)). The sigmoid squashes the output to the [0,1] range, which we interpret as a probability. The name comes from the fact that it uses a regression-like formulation (linear combination of features) but outputs probabilities via the logistic (sigmoid) function. Training minimizes binary cross-entropy loss. The decision boundary is linear in feature space. For multiclass problems, we extend it using softmax (multinomial logistic regression) or one-vs-rest/one-vs-one strategies. Logistic regression is still widely used because it is fast, interpretable, and works well when the decision boundary is approximately linear.

Q3: What are common loss functions and when do you use each?

💡

Model Answer: For regression: (1) MSE (Mean Squared Error) penalizes large errors quadratically, making it sensitive to outliers; use when outliers are meaningful. (2) MAE (Mean Absolute Error) penalizes all errors linearly, more robust to outliers. (3) Huber Loss combines MSE for small errors and MAE for large ones — good default for noisy data. For classification: (1) Binary Cross-Entropy (Log Loss) for binary classification; penalizes confident wrong predictions heavily. (2) Categorical Cross-Entropy for multiclass problems with softmax output. (3) Hinge Loss for SVM-style maximum margin classification. The choice depends on the problem: MSE is the default for regression, cross-entropy for classification, but outlier sensitivity, class imbalance, and specific business requirements can change the best choice.

Q4: Explain gradient descent. How does it work?

💡

Model Answer: Gradient descent is an iterative optimization algorithm that minimizes a loss function by repeatedly updating parameters in the direction of steepest descent. At each step: w_new = w_old - learning_rate * gradient(loss). The gradient points in the direction of increasing loss, so we move in the opposite direction. The learning rate controls step size — too large causes divergence, too small causes slow convergence. Three variants exist: (1) Batch GD computes the gradient over the entire dataset — stable but slow for large data. (2) Stochastic GD (SGD) computes the gradient on one sample — noisy but fast, can escape local minima. (3) Mini-batch GD computes on a small batch (typically 32-256) — balances stability and speed, and is the standard in practice. Convergence is guaranteed for convex problems with appropriate learning rate schedule; for non-convex problems (neural networks), we converge to local minima.

Q5: What is the difference between L1, L2, and ElasticNet regularization?

💡

Model Answer: L1 (Lasso) adds the penalty lambda * sum(|w_i|) to the loss. It drives some weights to exactly zero, performing automatic feature selection. The L1 penalty creates a diamond-shaped constraint region, and optima tend to occur at the corners (sparse solutions). L2 (Ridge) adds lambda * sum(w_i^2). It shrinks all weights toward zero but rarely to exactly zero. It distributes weight across correlated features evenly. ElasticNet combines both: lambda_1 * sum(|w_i|) + lambda_2 * sum(w_i^2). It gets L1's feature selection while handling correlated features better than pure L1 (which arbitrarily picks one of correlated features). Use L1 when you suspect many irrelevant features; L2 when most features contribute; ElasticNet when you have correlated feature groups.

Q6: How does a Support Vector Machine (SVM) work?

💡

Model Answer: SVM finds the hyperplane that maximizes the margin between two classes. The margin is the distance between the hyperplane and the nearest data points from each class (called support vectors). Maximizing the margin leads to better generalization. For linearly separable data, hard-margin SVM finds a perfect separator. For non-separable data, soft-margin SVM introduces slack variables that allow some misclassifications, controlled by the C parameter (high C = less tolerance for errors, potentially overfitting). For nonlinear boundaries, the kernel trick maps data to a higher-dimensional space where classes become linearly separable without explicitly computing the transformation. Common kernels: linear, polynomial, RBF (Gaussian). SVMs are effective in high-dimensional spaces and memory-efficient (only support vectors matter), but scale poorly to very large datasets (O(n^2) to O(n^3) training).

Q7: Explain how decision trees work and their main advantages/disadvantages.

💡

Model Answer: Decision trees recursively partition the feature space by selecting the feature and threshold that best separates the data at each node. For classification, splits are chosen to maximize information gain (decrease in entropy) or Gini impurity reduction. For regression, splits minimize variance of the target in each partition. Advantages: highly interpretable, handles mixed feature types, requires little preprocessing, captures nonlinear relationships and interactions. Disadvantages: prone to overfitting (especially deep trees), unstable (small data changes cause large tree changes), biased toward features with many values, and create axis-aligned boundaries only. Pruning (pre-pruning via max_depth or post-pruning) helps control overfitting. In practice, single decision trees are rarely used alone — ensemble methods like Random Forest and Gradient Boosting address their weaknesses.

Q8: How does Random Forest improve upon a single decision tree?

💡

Model Answer: Random Forest is a bagging ensemble of decision trees with two key sources of randomness: (1) each tree is trained on a bootstrap sample (random subset with replacement) of the training data, and (2) at each split, only a random subset of features is considered (typically sqrt(p) for classification, p/3 for regression). These two sources of randomness decorrelate the individual trees, so their errors cancel out when averaged. This dramatically reduces variance while maintaining low bias. Advantages over single trees: much less prone to overfitting, more stable predictions, provides feature importance estimates, handles high-dimensional data well. Disadvantages: less interpretable than a single tree, higher memory and computation cost, and it may still underperform gradient boosting on many tabular datasets.

Q9: What is gradient boosting and how does it differ from Random Forest?

💡

Model Answer: Gradient boosting builds an ensemble of trees sequentially, where each new tree fits the residual errors (negative gradient of the loss) of the current ensemble. It is an additive model: F(x) = F_prev(x) + learning_rate * h(x), where h(x) is the new tree. Key differences from Random Forest: (1) boosting trains trees sequentially (not in parallel), (2) each tree is typically shallow (3-8 levels) vs Random Forest's fully-grown trees, (3) boosting primarily reduces bias while RF primarily reduces variance. XGBoost, LightGBM, and CatBoost are popular implementations that add regularization, efficient histogram-based splitting, and handling of categorical features. Gradient boosting typically achieves higher accuracy on structured/tabular data but is more sensitive to hyperparameters (learning rate, number of trees, tree depth) and more prone to overfitting without proper regularization.

Q10: What is the kernel trick and why is it useful?

💡

Model Answer: The kernel trick allows algorithms that rely on dot products (like SVM) to operate in a high-dimensional feature space without explicitly computing the coordinates in that space. Instead of transforming x to phi(x) and computing phi(x_i) dot phi(x_j), we compute a kernel function K(x_i, x_j) = phi(x_i) dot phi(x_j) directly. Common kernels: Linear K(x,y) = x dot y, Polynomial K(x,y) = (x dot y + c)^d, RBF/Gaussian K(x,y) = exp(-gamma ||x-y||^2). The RBF kernel maps to an infinite-dimensional space. The trick is useful because: (1) it avoids the computational cost of explicitly computing high-dimensional features, (2) it enables learning nonlinear decision boundaries using linear algorithms, (3) it works even when the explicit feature map is infinite-dimensional (RBF). Mercer's theorem guarantees that any positive semi-definite function can be used as a kernel.

Q11: What is Naive Bayes and what makes it "naive"?

💡

Model Answer: Naive Bayes is a probabilistic classifier based on Bayes' theorem: P(y|x) is proportional to P(y) * P(x|y). The "naive" assumption is that all features are conditionally independent given the class label, meaning P(x|y) = product of P(x_i|y). This assumption is almost always violated in practice, yet Naive Bayes often performs surprisingly well because: (1) classification only requires getting the ranking of P(y|x) right, not the exact probabilities, and (2) the bias from the independence assumption can actually reduce variance. Variants include Gaussian NB (continuous features), Multinomial NB (word counts, great for text), and Bernoulli NB (binary features). Naive Bayes is extremely fast to train, works well with high-dimensional data, and is a strong baseline for text classification. It struggles when feature interactions are crucial to the prediction.

Q12: What is the difference between hard and soft margin in SVM?

💡

Model Answer: Hard-margin SVM requires that all training points are correctly classified with no points inside the margin. It only works when data is linearly separable and is extremely sensitive to outliers — a single misplaced point can drastically change the decision boundary. Soft-margin SVM introduces slack variables (xi_i) that allow some points to violate the margin or be misclassified. The objective becomes: minimize (1/2)||w||^2 + C * sum(xi_i). The hyperparameter C controls the tradeoff: large C penalizes misclassifications heavily (approaching hard margin), small C allows more violations (wider margin, more regularization). In practice, soft-margin SVM is almost always used because real data is rarely perfectly separable. Tuning C is equivalent to controlling the bias-variance tradeoff.

Q13: How does KNN work and what are its limitations?

💡

Model Answer: K-Nearest Neighbors (KNN) is a non-parametric, instance-based algorithm. For a new data point, it finds the K closest training examples and predicts by majority vote (classification) or averaging (regression). The distance metric (usually Euclidean, Manhattan, or Minkowski) determines "closeness." Key limitations: (1) Computationally expensive at prediction time — O(n*d) per query without indexing structures. (2) Performance degrades in high dimensions (curse of dimensionality) because distances become less meaningful. (3) Sensitive to irrelevant features and feature scaling — always normalize features before using KNN. (4) No model is learned — the entire training set must be stored. (5) Choosing K matters: small K is noisy and sensitive to outliers, large K smooths too much. K is typically tuned via cross-validation, often choosing odd values to avoid ties in binary classification.

Q14: What is the difference between multiclass and multilabel classification?

💡

Model Answer: Multiclass classification assigns each instance to exactly one of three or more mutually exclusive classes (e.g., classifying an image as cat, dog, or bird). Approaches include: one-vs-rest (train N binary classifiers), one-vs-one (train N*(N-1)/2 classifiers), or native multiclass algorithms (softmax, decision trees). Multilabel classification assigns each instance to zero or more labels simultaneously (e.g., tagging a movie as both "comedy" and "romance"). Approaches: (1) binary relevance (independent binary classifier per label), (2) classifier chains (sequential, modeling label dependencies), (3) neural networks with sigmoid outputs per label. The evaluation metrics differ too: multiclass uses accuracy or macro/micro F1, while multilabel uses Hamming loss, subset accuracy, or per-label metrics. The key question to ask the interviewer: "Are the labels mutually exclusive?"

Q15: When would you choose a linear model over a complex one?

💡

Model Answer: Choose a linear model when: (1) Interpretability is critical — regulated industries (healthcare, finance) often require explainable models. (2) Data is limited — simpler models generalize better with few samples (less variance). (3) Features are already well-engineered — domain experts have created meaningful features that have approximately linear relationships with the target. (4) Real-time prediction is needed — linear models are extremely fast at inference. (5) The relationship is approximately linear — many real-world problems are reasonably linear after proper feature engineering. (6) As a strong baseline — always start with a linear model; if it performs well, the added complexity of nonlinear models may not be justified. The practical rule: start simple, add complexity only when it demonstrably improves validation performance enough to justify the reduced interpretability and increased maintenance cost.

✅

Interview Tip: When asked about any algorithm, structure your answer as: (1) What problem does it solve? (2) How does it work (intuition first, then math if asked)? (3) What are its assumptions? (4) Advantages and disadvantages? (5) When would you use it vs alternatives?

← Previous Core ML Fundamentals Next → Unsupervised Learning Questions