Intermediate

Model Evaluation & Selection

15 interview questions and model answers on evaluation metrics, cross-validation, hyperparameter tuning, model selection, and handling class imbalance.

Q1: What is the confusion matrix and what information does it provide?

💡
Model Answer: A confusion matrix is a table that summarizes classification predictions vs actual labels. For binary classification, it has four cells: True Positives (TP) — correctly predicted positive, True Negatives (TN) — correctly predicted negative, False Positives (FP) — incorrectly predicted positive (Type I error), and False Negatives (FN) — incorrectly predicted negative (Type II error). From these four values, you can derive all classification metrics: accuracy = (TP+TN)/(TP+TN+FP+FN), precision = TP/(TP+FP), recall = TP/(TP+FN), F1 = 2*precision*recall/(precision+recall). The confusion matrix reveals the types of errors your model makes, which is far more informative than a single accuracy number, especially with imbalanced datasets.

Q2: Explain precision and recall. When would you prioritize one over the other?

💡
Model Answer: Precision = TP/(TP+FP) answers "Of all positive predictions, how many were actually positive?" Recall = TP/(TP+FN) answers "Of all actual positives, how many did we find?" Prioritize precision when false positives are costly: spam filtering (you do not want legitimate emails marked as spam), content recommendation (irrelevant suggestions erode trust). Prioritize recall when false negatives are costly: cancer screening (missing a cancer case is dangerous), fraud detection (missing fraud is expensive), search engines (users want all relevant results). There is always a tradeoff: increasing precision typically decreases recall and vice versa. You can control this tradeoff by adjusting the classification threshold. The F1 score balances both, but for most real applications, one type of error is more costly than the other.

Q3: What is the F1 score and when is it useful?

💡
Model Answer: The F1 score is the harmonic mean of precision and recall: F1 = 2 * (precision * recall) / (precision + recall). It ranges from 0 to 1, with 1 being perfect. The harmonic mean (rather than arithmetic mean) ensures that F1 is low when either precision or recall is low — a model must perform well on both to achieve a high F1. It is useful when: (1) you need a single metric that balances precision and recall, (2) the class distribution is imbalanced (accuracy is misleading), (3) you want to compare models on a balanced measure. For multiclass problems, there are three averaging methods: micro (aggregate TP/FP/FN across all classes), macro (average F1 per class, treating all classes equally), and weighted (average F1 per class, weighted by class frequency). Use macro-F1 when all classes are equally important regardless of size.

Q4: Explain AUC-ROC. What does it measure?

💡
Model Answer: The ROC (Receiver Operating Characteristic) curve plots True Positive Rate (recall) vs False Positive Rate (FP/(FP+TN)) at all classification thresholds. AUC (Area Under the Curve) summarizes this curve as a single number between 0 and 1. An AUC of 0.5 means the model is no better than random guessing; 1.0 means perfect separation. AUC measures the model's ability to rank positive examples higher than negative examples — specifically, it equals the probability that a randomly chosen positive example gets a higher predicted score than a randomly chosen negative example. Advantages: threshold-independent, works across all operating points. Limitations: can be misleading with severe class imbalance (use AUC-PR — Precision-Recall curve — instead), and does not tell you which threshold to use in practice. AUC-ROC is best for comparing models; for deployment, you still need to choose a specific threshold.

Q5: Why is accuracy misleading for imbalanced datasets?

💡
Model Answer: With imbalanced data (e.g., 99% negative, 1% positive), a model that simply predicts "negative" for everything achieves 99% accuracy while being completely useless — it catches zero positive cases. Accuracy treats all errors equally, but in most imbalanced problems, the minority class is the one you actually care about (fraud, disease, defects). Better metrics for imbalanced data include: precision, recall, F1 (especially macro-F1 or per-class), AUC-PR (Precision-Recall curve, which focuses on the positive class), and balanced accuracy (average of recall for each class). Additionally, when reporting results on imbalanced data, always show the confusion matrix so the types of errors are visible.

Q6: What is cross-validation and what variants exist?

💡
Model Answer: Cross-validation estimates generalization performance by systematically rotating training and validation sets. Variants: (1) K-Fold CV — split into K equal folds, train on K-1, validate on the held-out fold, repeat K times. Standard: K=5 or K=10. (2) Stratified K-Fold — preserves class proportions in each fold; essential for imbalanced datasets. (3) Leave-One-Out (LOO) — K=N, use each sample once as validation. Low bias but high variance and computationally expensive. Good for very small datasets. (4) Repeated K-Fold — run K-fold multiple times with different random splits, average results. Reduces variance of the estimate. (5) Time Series CV — expanding or sliding window to respect temporal order; never use future data to predict the past. (6) Group K-Fold — ensures all samples from the same group (e.g., same patient) are in the same fold, preventing data leakage.

Q7: How do you perform hyperparameter tuning?

💡
Model Answer: Hyperparameter tuning finds the best configuration by evaluating model performance across different settings. Methods: (1) Grid Search — exhaustively evaluates all combinations of specified parameter values. Simple but exponentially expensive as the number of parameters grows. (2) Random Search — samples parameter combinations randomly. Empirically shown to be more efficient than grid search because not all parameters are equally important (Bergstra & Bengio, 2012). (3) Bayesian Optimization (Optuna, Hyperopt) — builds a probabilistic model of the objective function and intelligently selects the next configuration to evaluate. Most sample-efficient but more complex to implement. (4) Successive Halving / Hyperband — allocates resources proportionally, quickly eliminating poor configurations. All methods use cross-validation to evaluate each configuration. Important: always tune on validation data, report final results on a separate test set to avoid optimistic estimates.

Q8: How do you handle class imbalance in a dataset?

💡
Model Answer: Strategies for class imbalance: Data-level: (1) Oversampling the minority class — SMOTE (Synthetic Minority Over-sampling Technique) creates synthetic examples by interpolating between existing minority samples. (2) Undersampling the majority class — faster but loses information. (3) Combination (SMOTE + Tomek links or edited nearest neighbors). Algorithm-level: (4) Class weights — assign higher loss weight to minority class samples (most frameworks support class_weight parameter). (5) Cost-sensitive learning — modify the loss function to penalize minority class errors more heavily. (6) Ensemble methods — BalancedRandomForest, EasyEnsemble, which combine undersampling with ensemble learning. Evaluation-level: (7) Use appropriate metrics (F1, AUC-PR, not accuracy). (8) Use stratified cross-validation. My default approach: start with class weights (simplest), then try SMOTE if needed, and always use stratified CV with appropriate metrics.

Q9: What is the difference between micro, macro, and weighted averaging for metrics?

💡
Model Answer: These averaging methods aggregate per-class metrics into a single number for multiclass problems. Micro-averaging aggregates the TP, FP, and FN across all classes first, then computes the metric. It is dominated by the performance on frequent classes. Micro-F1 equals overall accuracy for multiclass (when every sample gets one label). Macro-averaging computes the metric independently for each class and then takes the unweighted mean. It treats all classes equally regardless of size, making it sensitive to performance on rare classes. Use macro when all classes are equally important. Weighted averaging computes the metric per class and takes a weighted mean using class frequencies. It accounts for class imbalance while giving more weight to larger classes. In practice: use macro-F1 when minority class performance matters, micro-F1 when overall correctness matters, and weighted-F1 as a compromise.

Q10: How do you choose between multiple models?

💡
Model Answer: Model selection considers multiple factors beyond raw accuracy: (1) Performance — compare using cross-validated metrics appropriate for your problem (not just accuracy). Use statistical tests (paired t-test, Wilcoxon signed-rank) to determine if differences are significant. (2) Complexity and interpretability — simpler models are preferred if performance is comparable (Occam's razor). Regulated industries may require interpretable models. (3) Training and inference time — a model that is 1% better but 100x slower may not be worth it in production. (4) Data requirements — complex models need more data; with limited data, simpler models may generalize better. (5) Maintenance cost — consider model retraining frequency, monitoring complexity, and debugging difficulty. (6) Robustness — how sensitive is the model to distribution shift, adversarial inputs, or missing features? My framework: start with a simple baseline, add complexity only when it demonstrably improves the metric that matters, factoring in all practical constraints.

Q11: What is the difference between AUC-ROC and AUC-PR?

💡
Model Answer: AUC-ROC plots TPR vs FPR. It can be overly optimistic on imbalanced datasets because FPR uses TN in the denominator — with many negatives, even many false positives result in a low FPR, making the curve look good. AUC-PR (Precision-Recall) plots precision vs recall. Since precision uses TP+FP (not TN), it is directly affected by false positives and gives a more informative picture when the positive class is rare. Example: with 99% negatives and 1% positives, a model producing 100 FP out of 9,900 negatives has FPR = 1% (looks great in ROC) but precision = TP/(TP+100) (might be poor in PR). Rule of thumb: use AUC-ROC when classes are roughly balanced or when you care about both positive and negative class performance. Use AUC-PR when the positive class is rare and is the class you care about (fraud, disease, defects).

Q12: What is a calibrated model and why does calibration matter?

💡
Model Answer: A model is well-calibrated if its predicted probabilities match the true frequencies. For example, among all cases where the model predicts "70% probability of class A," approximately 70% should actually be class A. Calibration matters when you use predicted probabilities for decision-making (not just the argmax prediction): insurance pricing, medical diagnosis thresholds, ranking systems, and any application where you combine predictions with costs. Some models are naturally well-calibrated (logistic regression with sufficient data), while others are not (random forests tend to predict near 0 or 1, SVMs output distances not probabilities, neural networks are often overconfident). Calibration methods include Platt scaling (fit a sigmoid on the outputs) and isotonic regression (non-parametric, needs more data). Evaluate calibration using reliability diagrams (calibration plots) and the Brier score.

Q13: What is stratified sampling and why is it important?

💡
Model Answer: Stratified sampling ensures that each subset (fold, split) maintains the same class distribution as the original dataset. It is critical for: (1) Imbalanced datasets — random splitting could result in folds with zero minority class samples, making evaluation unreliable. (2) Small datasets — random variation in class proportions across folds adds noise to performance estimates. (3) Cross-validation — stratified K-fold ensures each fold is representative, reducing variance in the performance estimate. In scikit-learn, use StratifiedKFold instead of KFold, and set stratify parameter in train_test_split. Beyond class labels, you should also stratify by other important attributes (e.g., demographic groups, time periods) when their distribution matters for fair and representative evaluation.

Q14: How do you detect and prevent data leakage during model evaluation?

💡
Model Answer: Detection: (1) suspiciously high validation/test performance (too good to be true), (2) significant performance drop in production vs development, (3) feature importance reveals features that logically should not be predictive. Prevention: (1) Split first, preprocess second — fit transformers (scalers, encoders, imputers) only on training data, then transform validation/test using the training statistics. (2) Use pipelines — scikit-learn Pipeline or ColumnTransformer encapsulate preprocessing and ensure it happens within each CV fold. (3) Time-based splits for temporal data — never use future data in training. (4) Audit features — check each feature for potential target leakage by asking "would this feature be available at prediction time?" (5) Group-aware splits — if samples from the same entity exist in both train and test, information leaks through the entity.

Q15: What is the learning curve and what does it tell you?

💡
Model Answer: A learning curve plots model performance (y-axis) vs training set size (x-axis) for both training and validation sets. It diagnoses whether a model suffers from high bias or high variance: High bias (underfitting): both training and validation scores are low and converge to a low value. Adding more data will not help — you need a more complex model or better features. High variance (overfitting): training score is high but validation score is much lower, with a large gap. Adding more data will help because the gap narrows with more samples. Good fit: both scores are high and converge with a small gap. Learning curves inform critical decisions: should you collect more data (helps with high variance, not high bias) or invest in feature engineering / model complexity (helps with high bias)? This makes them one of the most practical diagnostic tools in ML.
Interview Tip: When discussing metrics, always ask "What is the business cost of each type of error?" This shows you understand that metric choice is a business decision, not just a technical one. The best metric depends on the relative cost of false positives vs false negatives in the specific application.