Model Evaluation
Learn how to properly evaluate ML models: train/test splits, cross-validation, classification and regression metrics, confusion matrices, and the bias-variance tradeoff.
Train/Test Split
The most fundamental evaluation technique: split your data into a training set (used to train the model) and a test set (used to evaluate on unseen data). A common split is 80/20 or 70/30.
For hyperparameter tuning, use a three-way split: train/validation/test (e.g., 60/20/20). Train on the training set, tune on validation, and report final performance on the test set. Never use the test set for any decisions during development.
from sklearn.model_selection import train_test_split # Simple train/test split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # stratify=y ensures class distribution is preserved
Cross-Validation
A single train/test split may not be representative. K-Fold cross-validation splits data into K equal parts, trains on K-1 folds, and tests on the remaining fold. This is repeated K times (each fold serves as the test set once), and results are averaged.
from sklearn.model_selection import cross_val_score, StratifiedKFold from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=100, random_state=42) # 5-fold cross-validation scores = cross_val_score(model, X, y, cv=5, scoring='accuracy') print(f"Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})") # Stratified K-Fold (preserves class balance) skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(model, X, y, cv=skf, scoring='f1_weighted')
Classification Metrics
| Metric | Formula | When to Use |
|---|---|---|
| Accuracy | Correct / Total | Balanced classes. Misleading for imbalanced data. |
| Precision | TP / (TP + FP) | When false positives are costly (spam detection). |
| Recall (Sensitivity) | TP / (TP + FN) | When false negatives are costly (disease detection). |
| F1 Score | 2 * (Precision * Recall) / (P + R) | Balance between precision and recall. Imbalanced data. |
| ROC-AUC | Area under ROC curve | Overall classifier quality. Threshold-independent. |
Confusion Matrix
A confusion matrix shows the breakdown of predictions vs. actual labels:
from sklearn.metrics import (confusion_matrix, classification_report, roc_auc_score) y_pred = model.predict(X_test) # Confusion matrix cm = confusion_matrix(y_test, y_pred) print("Confusion Matrix:") print(cm) # Predicted # Neg Pos # Actual Neg [ TN FP ] # Actual Pos [ FN TP ] # Full report print(classification_report(y_test, y_pred)) # ROC-AUC (for binary classification) y_prob = model.predict_proba(X_test)[:, 1] auc = roc_auc_score(y_test, y_prob) print(f"ROC-AUC: {auc:.4f}")
Regression Metrics
| Metric | Description | Interpretation |
|---|---|---|
| MSE | Mean Squared Error | Average squared difference. Penalizes large errors heavily. |
| RMSE | Root Mean Squared Error | Square root of MSE. Same units as the target. Most common. |
| MAE | Mean Absolute Error | Average absolute difference. More robust to outliers than MSE. |
| R-squared | Coefficient of determination | Proportion of variance explained. 1.0 is perfect; 0.0 means the model is no better than predicting the mean. |
Learning Curves
Learning curves plot training and validation performance against the number of training samples. They help diagnose whether your model suffers from:
- High bias (underfitting): Both training and validation scores are low and converge. The model is too simple. Solution: use a more complex model, add features.
- High variance (overfitting): Training score is high but validation score is much lower. The model memorizes training data. Solution: more data, regularization, simpler model.
Bias-Variance Tradeoff
The fundamental tension in machine learning:
- Bias: Error from overly simplistic models that cannot capture the true pattern. High bias = underfitting.
- Variance: Error from models that are too sensitive to training data noise. High variance = overfitting.
- Goal: Find the sweet spot where total error (bias + variance) is minimized.
| Symptom | Diagnosis | Solution |
|---|---|---|
| High training error, high test error | Underfitting (high bias) | More complex model, more features, less regularization |
| Low training error, high test error | Overfitting (high variance) | More data, regularization, simpler model, dropout |
| Low training error, low test error | Good fit | Deploy and monitor |