Intermediate
Model Selection
Learn cross-validation strategies, hyperparameter tuning with grid and randomized search, and choose the right evaluation metrics for your problem.
Cross-Validation
Cross-validation provides a robust estimate of model performance by training and evaluating on different subsets of data:
Python
from sklearn.model_selection import cross_val_score, StratifiedKFold # Basic 5-fold cross-validation scores = cross_val_score(model, X, y, cv=5, scoring="accuracy") print(f"Accuracy: {scores.mean():.3f} ± {scores.std():.3f}") # Stratified K-Fold (preserves class distribution) skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(model, X, y, cv=skf) # Get predictions for each fold (useful for stacking) from sklearn.model_selection import cross_val_predict y_pred = cross_val_predict(model, X, y, cv=5)
Grid Search
Python
from sklearn.model_selection import GridSearchCV from sklearn.svm import SVC # Define parameter grid param_grid = { "C": [0.1, 1, 10, 100], "kernel": ["rbf", "linear"], "gamma": ["scale", "auto", 0.01, 0.1] } # Exhaustive search over all combinations grid = GridSearchCV(SVC(), param_grid, cv=5, scoring="accuracy", n_jobs=-1) grid.fit(X_train, y_train) print(f"Best params: {grid.best_params_}") print(f"Best CV score: {grid.best_score_:.3f}") print(f"Test score: {grid.score(X_test, y_test):.3f}")
Randomized Search
Python
from sklearn.model_selection import RandomizedSearchCV from scipy.stats import uniform, randint # Define distributions for random sampling param_dist = { "n_estimators": randint(50, 500), "max_depth": randint(3, 20), "min_samples_split": randint(2, 20), "learning_rate": uniform(0.01, 0.3) } # Sample 100 random combinations (faster than grid search) random_search = RandomizedSearchCV( model, param_dist, n_iter=100, cv=5, scoring="accuracy", random_state=42, n_jobs=-1 ) random_search.fit(X_train, y_train)
Evaluation Metrics
| Metric | Use Case | sklearn Scoring |
|---|---|---|
| Accuracy | Balanced classes | "accuracy" |
| F1 Score | Imbalanced classes | "f1" or "f1_weighted" |
| ROC AUC | Binary classification ranking | "roc_auc" |
| R² | Regression | "r2" |
| RMSE | Regression (interpretable units) | "neg_root_mean_squared_error" |
| Log Loss | Probabilistic classification | "neg_log_loss" |
Grid vs Random: Randomized search is often more efficient than grid search. With the same computational budget, it explores more of the hyperparameter space because it does not waste time on unimportant parameter combinations.
Next: Pipelines
Learn how to chain preprocessing and modeling steps into reproducible, leak-free ML workflows.
Next: Pipelines →
Lilly Tech Systems