Intermediate

Cross-Validation Techniques

K-fold, stratified, time series, and group cross-validation. Part of the AI Model Testing Fundamentals course at AI School by Lilly Tech Systems.

Why Cross-Validation Matters

A single train-test split gives you one estimate of model performance, and that estimate can be misleading. If you happen to get an easy test set, your model looks better than it is. If you get a hard one, it looks worse. Cross-validation solves this by systematically evaluating your model on multiple different splits of the data, giving you a more robust and reliable performance estimate.

Cross-validation is essential for model selection, hyperparameter tuning, and providing honest performance estimates. Without it, you risk overfitting to a specific test set and deploying a model that performs worse than expected in production.

K-Fold Cross-Validation

The most common cross-validation technique splits the data into K equally-sized folds. The model is trained K times, each time using K-1 folds for training and the remaining fold for validation. The final performance metric is the average across all K runs.

from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)

# 5-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')

print(f"Fold scores: {scores}")
print(f"Mean accuracy: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")

Choosing the Right K

The choice of K involves a bias-variance tradeoff. K=5 or K=10 are the most common choices. Smaller K values (like K=2 or K=3) use less training data per fold, introducing more bias. Larger K values (like K=20 or leave-one-out) reduce bias but increase variance and computational cost. For most practical purposes, K=5 or K=10 strikes a good balance.

Stratified K-Fold

Standard K-fold can produce folds with very different class distributions, especially for imbalanced datasets. Stratified K-fold ensures that each fold maintains the same proportion of classes as the original dataset. This is critical for imbalanced classification problems:

from sklearn.model_selection import StratifiedKFold

# Always use stratified K-fold for classification
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='f1_weighted')
print(f"Stratified CV F1: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")
💡
Best practice: Always use stratified K-fold for classification tasks. Standard K-fold can create folds where minority classes are completely absent, giving unreliable results. Scikit-learn's cross_val_score uses stratified K-fold by default for classifiers.

Time Series Cross-Validation

Standard cross-validation is invalid for time series data because it allows the model to train on future data and predict the past (data leakage). Time series cross-validation uses an expanding or sliding window approach that always trains on past data and tests on future data:

  • Expanding window — Training set grows with each fold, always predicting the next time period
  • Sliding window — Training set has a fixed size that slides forward through time
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
    print(f"Fold {fold}: Train size={len(train_idx)}, Test size={len(test_idx)}")
    print(f"  Train indices: {train_idx[0]}-{train_idx[-1]}")
    print(f"  Test indices:  {test_idx[0]}-{test_idx[-1]}")

Group K-Fold

When your data has natural groupings (e.g., multiple samples from the same patient, user, or session), standard cross-validation can leak information between folds. Group K-fold ensures that all samples from the same group appear in only one fold, preventing this leakage.

Nested Cross-Validation

When you use cross-validation for both hyperparameter tuning and performance estimation, you need nested cross-validation. The outer loop estimates model performance, while the inner loop selects the best hyperparameters. This prevents optimistic bias from selecting hyperparameters on the same data used for evaluation.

Common pitfall: Using the same cross-validation loop for both hyperparameter tuning and performance estimation produces overly optimistic results. Always use nested cross-validation when tuning hyperparameters, or use a completely separate held-out test set.

Practical Guidelines

For tabular classification use stratified K-fold with K equals 5 or 10. For time series always use TimeSeriesSplit or a custom temporal split. For grouped data use GroupKFold. For small datasets consider repeated K-fold (running K-fold multiple times with different random splits). For hyperparameter tuning pair cross-validation with grid search or random search using nested CV.