Intermediate

Model Evaluation

Learn how to properly evaluate ML models: train/test splits, cross-validation, classification and regression metrics, confusion matrices, and the bias-variance tradeoff.

Train/Test Split

The most fundamental evaluation technique: split your data into a training set (used to train the model) and a test set (used to evaluate on unseen data). A common split is 80/20 or 70/30.

For hyperparameter tuning, use a three-way split: train/validation/test (e.g., 60/20/20). Train on the training set, tune on validation, and report final performance on the test set. Never use the test set for any decisions during development.

Python (sklearn)
from sklearn.model_selection import train_test_split

# Simple train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
# stratify=y ensures class distribution is preserved

Cross-Validation

A single train/test split may not be representative. K-Fold cross-validation splits data into K equal parts, trains on K-1 folds, and tests on the remaining fold. This is repeated K times (each fold serves as the test set once), and results are averaged.

Python (Cross-Validation)
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")

# Stratified K-Fold (preserves class balance)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='f1_weighted')
💡
Always use stratified folds for classification. Regular K-Fold can create folds with very different class distributions, especially for imbalanced datasets. StratifiedKFold ensures each fold mirrors the overall class distribution.

Classification Metrics

MetricFormulaWhen to Use
AccuracyCorrect / TotalBalanced classes. Misleading for imbalanced data.
PrecisionTP / (TP + FP)When false positives are costly (spam detection).
Recall (Sensitivity)TP / (TP + FN)When false negatives are costly (disease detection).
F1 Score2 * (Precision * Recall) / (P + R)Balance between precision and recall. Imbalanced data.
ROC-AUCArea under ROC curveOverall classifier quality. Threshold-independent.

Confusion Matrix

A confusion matrix shows the breakdown of predictions vs. actual labels:

Python (Confusion Matrix)
from sklearn.metrics import (confusion_matrix,
    classification_report, roc_auc_score)

y_pred = model.predict(X_test)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
#              Predicted
#              Neg    Pos
# Actual Neg [ TN     FP ]
# Actual Pos [ FN     TP ]

# Full report
print(classification_report(y_test, y_pred))

# ROC-AUC (for binary classification)
y_prob = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_prob)
print(f"ROC-AUC: {auc:.4f}")

Regression Metrics

MetricDescriptionInterpretation
MSEMean Squared ErrorAverage squared difference. Penalizes large errors heavily.
RMSERoot Mean Squared ErrorSquare root of MSE. Same units as the target. Most common.
MAEMean Absolute ErrorAverage absolute difference. More robust to outliers than MSE.
R-squaredCoefficient of determinationProportion of variance explained. 1.0 is perfect; 0.0 means the model is no better than predicting the mean.

Learning Curves

Learning curves plot training and validation performance against the number of training samples. They help diagnose whether your model suffers from:

  • High bias (underfitting): Both training and validation scores are low and converge. The model is too simple. Solution: use a more complex model, add features.
  • High variance (overfitting): Training score is high but validation score is much lower. The model memorizes training data. Solution: more data, regularization, simpler model.

Bias-Variance Tradeoff

The fundamental tension in machine learning:

  • Bias: Error from overly simplistic models that cannot capture the true pattern. High bias = underfitting.
  • Variance: Error from models that are too sensitive to training data noise. High variance = overfitting.
  • Goal: Find the sweet spot where total error (bias + variance) is minimized.
SymptomDiagnosisSolution
High training error, high test errorUnderfitting (high bias)More complex model, more features, less regularization
Low training error, high test errorOverfitting (high variance)More data, regularization, simpler model, dropout
Low training error, low test errorGood fitDeploy and monitor
Evaluation checklist: 1) Use stratified splits for classification. 2) Use cross-validation, not just a single split. 3) Choose metrics that match your business problem (precision vs. recall). 4) Always compare against a simple baseline. 5) Check for data leakage. 6) Report confidence intervals or standard deviations.