Ensemble Methods (7)
Combining multiple models for superior performance
Ensemble methods combine multiple base learners to produce a model that is more accurate and robust than any individual model. The key insight: a group of weak learners can form a strong learner. Ensembles consistently win ML competitions and power production ML systems.
Ensemble Strategies Overview
| Strategy | How It Works | Reduces | Examples |
|---|---|---|---|
| Bagging | Train models on bootstrap samples in parallel, aggregate by voting/averaging | Variance | Random Forest, Bagging Classifier |
| Boosting | Train models sequentially, each correcting previous errors | Bias | AdaBoost, Gradient Boosting, XGBoost |
| Stacking | Train a meta-learner on predictions of base models | Both | Stacking Classifier/Regressor |
| Voting | Combine predictions of different model types by voting/averaging | Variance | Voting Classifier/Regressor |
1. Bagging (Bootstrap Aggregating)
Description: Creates multiple bootstrap samples (random sampling with replacement) from the training data, trains a separate base model on each sample, then combines predictions by majority voting (classification) or averaging (regression). Reduces variance and overfitting.
When to Use
- When your base model has high variance (e.g., deep decision trees)
- When you want to reduce overfitting without sacrificing complexity
- When you can afford the computational cost of training multiple models
How It Works
- Create T bootstrap samples by randomly sampling n points with replacement
- Train one base model on each bootstrap sample
- Combine predictions: vote (classification) or average (regression)
- Out-of-bag (OOB) samples can be used for validation without a separate test set
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Bagging with Decision Trees
model = BaggingClassifier(
estimator=DecisionTreeClassifier(),
n_estimators=50,
max_samples=0.8, # 80% of data per bootstrap sample
max_features=0.8, # 80% of features per model
bootstrap=True,
oob_score=True, # Use out-of-bag samples for scoring
random_state=42,
n_jobs=-1
)
model.fit(X_train, y_train)
print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")
print(f"OOB Score: {model.oob_score_:.4f}")
2. Boosting (General Concept)
Description: A family of algorithms that sequentially train weak learners, with each new learner focusing on the mistakes of the previous ensemble. The key idea is to convert weak learners (slightly better than random) into a strong learner. Different boosting variants differ in how they weight errors and combine predictions.
When to Use
- When you need to reduce bias (underfitting)
- When you have a weak base learner you want to improve
- When accuracy is more important than interpretability
- For structured/tabular data competitions
Boosting vs. Bagging
| Aspect | Bagging | Boosting |
|---|---|---|
| Training | Parallel | Sequential |
| Reduces | Variance | Bias (and variance) |
| Base learners | Strong (deep trees) | Weak (stumps/shallow trees) |
| Overfitting risk | Lower | Higher (needs tuning) |
| Weighting | Equal weight | Weighted by performance |
# General boosting example with HistGradientBoosting (sklearn's fast boosting)
from sklearn.ensemble import HistGradientBoostingClassifier
model = HistGradientBoostingClassifier(
max_iter=200,
learning_rate=0.1,
max_depth=5,
min_samples_leaf=20,
random_state=42
)
model.fit(X_train, y_train)
print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")
3. Random Forest
Description: A specific type of bagging that uses decision trees as base learners and adds an extra layer of randomness: at each split, only a random subset of features is considered. This decorrelates the trees, making the ensemble more robust than plain bagging of trees.
When to Use
- General-purpose classification and regression
- When you need feature importance rankings
- When you want a model that works well with minimal tuning
- When you need robustness to outliers and noise
Key Innovations Over Bagging
- Feature randomness: Each split considers only sqrt(p) features (classification) or p/3 (regression)
- Tree decorrelation: Trees become more independent, improving variance reduction
- OOB error: Built-in validation using out-of-bag samples
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
n_estimators=300,
max_depth=None, # Grow full trees
max_features='sqrt', # sqrt(n_features) per split
min_samples_leaf=2,
oob_score=True,
random_state=42,
n_jobs=-1
)
model.fit(X_train, y_train)
print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")
print(f"OOB Score: {model.oob_score_:.4f}")
# Feature importance
import numpy as np
feature_names = load_breast_cancer().feature_names
importances = model.feature_importances_
top_idx = np.argsort(importances)[-5:][::-1]
for i in top_idx:
print(f" {feature_names[i]}: {importances[i]:.4f}")
4. Gradient Boosting
Description: Builds the ensemble by fitting each new tree to the negative gradient (residuals) of the loss function with respect to the current ensemble's predictions. This gradient descent in function space approach is more general than AdaBoost and supports any differentiable loss function.
When to Use
- When maximum predictive accuracy is the goal
- Structured/tabular data
- When you can invest time in hyperparameter tuning
- Competitions (often the winning approach)
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(
n_estimators=300,
learning_rate=0.05, # Shrinkage: smaller = more robust
max_depth=3, # Shallow trees as weak learners
subsample=0.8, # Stochastic gradient boosting
min_samples_leaf=10,
max_features='sqrt',
random_state=42
)
model.fit(X_train, y_train)
print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")
# Staged prediction (show performance vs n_estimators)
from sklearn.metrics import accuracy_score
staged_scores = [accuracy_score(y_test, pred)
for pred in model.staged_predict(X_test)]
print(f"Best at iteration {np.argmax(staged_scores)+1}: {max(staged_scores):.4f}")
5. AdaBoost (Adaptive Boosting)
Description: The original boosting algorithm. Assigns equal weights to all training samples initially. After each weak learner is trained, it increases weights on misclassified samples and decreases weights on correctly classified ones. The final prediction is a weighted vote of all weak learners.
When to Use
- When you have a simple base learner to boost
- Binary classification problems
- When you want a boosting method that is less prone to overfitting than gradient boosting
- Face detection (historically important: Viola-Jones algorithm)
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
model = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1), # Decision stumps
n_estimators=200,
learning_rate=0.1,
algorithm='SAMME',
random_state=42
)
model.fit(X_train, y_train)
print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")
# Show how each estimator contributes
print(f"Estimator weights (first 10): {model.estimator_weights_[:10].round(3)}")
print(f"Estimator errors (first 10): {model.estimator_errors_[:10].round(3)}")
6. Stacking (Stacked Generalization)
Description: Uses predictions from multiple diverse base models as input features for a meta-learner (second-level model). The meta-learner learns the optimal way to combine the base models' predictions. Uses cross-validation to generate base model predictions to avoid data leakage.
When to Use
- When you want to combine fundamentally different model types
- When individual models have complementary strengths
- Competition settings where marginal improvements matter
- When you have enough data for reliable cross-validation
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
# Define base models (level-0)
base_models = [
('dt', DecisionTreeClassifier(max_depth=5, random_state=42)),
('knn', KNeighborsClassifier(n_neighbors=5)),
('svc', SVC(kernel='rbf', probability=True, random_state=42)),
]
# Define meta-learner (level-1)
meta_learner = LogisticRegression(max_iter=1000)
model = StackingClassifier(
estimators=base_models,
final_estimator=meta_learner,
cv=5, # Cross-validation folds for base predictions
stack_method='auto', # Use predict_proba if available
n_jobs=-1
)
model.fit(X_train, y_train)
print(f"Stacking Accuracy: {model.score(X_test, y_test):.4f}")
# Compare with individual models
for name, est in base_models:
est.fit(X_train, y_train)
print(f" {name} alone: {est.score(X_test, y_test):.4f}")
7. Voting Classifier
Description: Combines predictions from multiple different model types. Hard voting uses majority vote on predicted classes. Soft voting averages predicted probabilities (generally more accurate). Unlike stacking, voting does not train a meta-learner -- it simply aggregates.
When to Use
- When you have several well-tuned but different models
- Quick ensemble without the complexity of stacking
- When models make different types of errors (low correlation in predictions)
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
# Hard Voting
model_hard = VotingClassifier(
estimators=[
('lr', LogisticRegression(max_iter=1000)),
('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('svc', SVC(kernel='rbf', random_state=42)),
],
voting='hard',
n_jobs=-1
)
model_hard.fit(X_train, y_train)
print(f"Hard Voting Accuracy: {model_hard.score(X_test, y_test):.4f}")
# Soft Voting (requires probability support)
model_soft = VotingClassifier(
estimators=[
('lr', LogisticRegression(max_iter=1000)),
('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('svc', SVC(kernel='rbf', probability=True, random_state=42)),
],
voting='soft',
weights=[1, 2, 1], # Give more weight to Random Forest
n_jobs=-1
)
model_soft.fit(X_train, y_train)
print(f"Soft Voting Accuracy: {model_soft.score(X_test, y_test):.4f}")