Intermediate

Model Evaluation

Learn how to properly evaluate ML models: train/test splits, cross-validation, classification and regression metrics, confusion matrices, and the bias-variance tradeoff.

Train/Test Split

The most fundamental evaluation technique: split your data into a training set (used to train the model) and a test set (used to evaluate on unseen data). A common split is 80/20 or 70/30.

For hyperparameter tuning, use a three-way split: train/validation/test (e.g., 60/20/20). Train on the training set, tune on validation, and report final performance on the test set. Never use the test set for any decisions during development.

Python (sklearn)

from sklearn.model_selection import train_test_split

# Simple train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
# stratify=y ensures class distribution is preserved

Cross-Validation

A single train/test split may not be representative. K-Fold cross-validation splits data into K equal parts, trains on K-1 folds, and tests on the remaining fold. This is repeated K times (each fold serves as the test set once), and results are averaged.

Python (Cross-Validation)

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")

# Stratified K-Fold (preserves class balance)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='f1_weighted')

💡

Always use stratified folds for classification. Regular K-Fold can create folds with very different class distributions, especially for imbalanced datasets. StratifiedKFold ensures each fold mirrors the overall class distribution.

Classification Metrics

Metric	Formula	When to Use
Accuracy	Correct / Total	Balanced classes. Misleading for imbalanced data.
Precision	TP / (TP + FP)	When false positives are costly (spam detection).
Recall (Sensitivity)	TP / (TP + FN)	When false negatives are costly (disease detection).
F1 Score	2 * (Precision * Recall) / (P + R)	Balance between precision and recall. Imbalanced data.
ROC-AUC	Area under ROC curve	Overall classifier quality. Threshold-independent.

Confusion Matrix

A confusion matrix shows the breakdown of predictions vs. actual labels:

Python (Confusion Matrix)

from sklearn.metrics import (confusion_matrix,
    classification_report, roc_auc_score)

y_pred = model.predict(X_test)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
#              Predicted
#              Neg    Pos
# Actual Neg [ TN     FP ]
# Actual Pos [ FN     TP ]

# Full report
print(classification_report(y_test, y_pred))

# ROC-AUC (for binary classification)
y_prob = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_prob)
print(f"ROC-AUC: {auc:.4f}")

Regression Metrics

Metric	Description	Interpretation
MSE	Mean Squared Error	Average squared difference. Penalizes large errors heavily.
RMSE	Root Mean Squared Error	Square root of MSE. Same units as the target. Most common.
MAE	Mean Absolute Error	Average absolute difference. More robust to outliers than MSE.
R-squared	Coefficient of determination	Proportion of variance explained. 1.0 is perfect; 0.0 means the model is no better than predicting the mean.

Learning Curves

Learning curves plot training and validation performance against the number of training samples. They help diagnose whether your model suffers from:

High bias (underfitting): Both training and validation scores are low and converge. The model is too simple. Solution: use a more complex model, add features.
High variance (overfitting): Training score is high but validation score is much lower. The model memorizes training data. Solution: more data, regularization, simpler model.

Bias-Variance Tradeoff

The fundamental tension in machine learning:

Bias: Error from overly simplistic models that cannot capture the true pattern. High bias = underfitting.
Variance: Error from models that are too sensitive to training data noise. High variance = overfitting.
Goal: Find the sweet spot where total error (bias + variance) is minimized.

Symptom	Diagnosis	Solution
High training error, high test error	Underfitting (high bias)	More complex model, more features, less regularization
Low training error, high test error	Overfitting (high variance)	More data, regularization, simpler model, dropout
Low training error, low test error	Good fit	Deploy and monitor

✅

Evaluation checklist: 1) Use stratified splits for classification. 2) Use cross-validation, not just a single split. 3) Choose metrics that match your business problem (precision vs. recall). 4) Always compare against a simple baseline. 5) Check for data leakage. 6) Report confidence intervals or standard deviations.

← Previous Unsupervised Learning Next → Feature Engineering