Intermediate

Model Selection

Learn cross-validation strategies, hyperparameter tuning with grid and randomized search, and choose the right evaluation metrics for your problem.

Cross-Validation

Cross-validation provides a robust estimate of model performance by training and evaluating on different subsets of data:

Python
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Basic 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
print(f"Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

# Stratified K-Fold (preserves class distribution)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)

# Get predictions for each fold (useful for stacking)
from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(model, X, y, cv=5)

Grid Search

Python
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Define parameter grid
param_grid = {
    "C": [0.1, 1, 10, 100],
    "kernel": ["rbf", "linear"],
    "gamma": ["scale", "auto", 0.01, 0.1]
}

# Exhaustive search over all combinations
grid = GridSearchCV(SVC(), param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid.fit(X_train, y_train)

print(f"Best params: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.3f}")
print(f"Test score: {grid.score(X_test, y_test):.3f}")

Randomized Search

Python
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

# Define distributions for random sampling
param_dist = {
    "n_estimators": randint(50, 500),
    "max_depth": randint(3, 20),
    "min_samples_split": randint(2, 20),
    "learning_rate": uniform(0.01, 0.3)
}

# Sample 100 random combinations (faster than grid search)
random_search = RandomizedSearchCV(
    model, param_dist, n_iter=100, cv=5,
    scoring="accuracy", random_state=42, n_jobs=-1
)
random_search.fit(X_train, y_train)

Evaluation Metrics

MetricUse Casesklearn Scoring
AccuracyBalanced classes"accuracy"
F1 ScoreImbalanced classes"f1" or "f1_weighted"
ROC AUCBinary classification ranking"roc_auc"
Regression"r2"
RMSERegression (interpretable units)"neg_root_mean_squared_error"
Log LossProbabilistic classification"neg_log_loss"
Grid vs Random: Randomized search is often more efficient than grid search. With the same computational budget, it explores more of the hyperparameter space because it does not waste time on unimportant parameter combinations.

Next: Pipelines

Learn how to chain preprocessing and modeling steps into reproducible, leak-free ML workflows.

Next: Pipelines →