Home » ML Algorithm Directory » Ensemble Methods

Ensemble Methods (7)

Combining multiple models for superior performance

Ensemble methods combine multiple base learners to produce a model that is more accurate and robust than any individual model. The key insight: a group of weak learners can form a strong learner. Ensembles consistently win ML competitions and power production ML systems.

Ensemble Strategies Overview

Strategy	How It Works	Reduces	Examples
Bagging	Train models on bootstrap samples in parallel, aggregate by voting/averaging	Variance	Random Forest, Bagging Classifier
Boosting	Train models sequentially, each correcting previous errors	Bias	AdaBoost, Gradient Boosting, XGBoost
Stacking	Train a meta-learner on predictions of base models	Both	Stacking Classifier/Regressor
Voting	Combine predictions of different model types by voting/averaging	Variance	Voting Classifier/Regressor

1. Bagging (Bootstrap Aggregating)

Description: Creates multiple bootstrap samples (random sampling with replacement) from the training data, trains a separate base model on each sample, then combines predictions by majority voting (classification) or averaging (regression). Reduces variance and overfitting.

When to Use

When your base model has high variance (e.g., deep decision trees)
When you want to reduce overfitting without sacrificing complexity
When you can afford the computational cost of training multiple models

How It Works

Create T bootstrap samples by randomly sampling n points with replacement
Train one base model on each bootstrap sample
Combine predictions: vote (classification) or average (regression)
Out-of-bag (OOB) samples can be used for validation without a separate test set

from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Bagging with Decision Trees
model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    max_samples=0.8,         # 80% of data per bootstrap sample
    max_features=0.8,        # 80% of features per model
    bootstrap=True,
    oob_score=True,          # Use out-of-bag samples for scoring
    random_state=42,
    n_jobs=-1
)
model.fit(X_train, y_train)

print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")
print(f"OOB Score: {model.oob_score_:.4f}")

2. Boosting (General Concept)

Description: A family of algorithms that sequentially train weak learners, with each new learner focusing on the mistakes of the previous ensemble. The key idea is to convert weak learners (slightly better than random) into a strong learner. Different boosting variants differ in how they weight errors and combine predictions.

When to Use

When you need to reduce bias (underfitting)
When you have a weak base learner you want to improve
When accuracy is more important than interpretability
For structured/tabular data competitions

Boosting vs. Bagging

Aspect	Bagging	Boosting
Training	Parallel	Sequential
Reduces	Variance	Bias (and variance)
Base learners	Strong (deep trees)	Weak (stumps/shallow trees)
Overfitting risk	Lower	Higher (needs tuning)
Weighting	Equal weight	Weighted by performance

# General boosting example with HistGradientBoosting (sklearn's fast boosting)
from sklearn.ensemble import HistGradientBoostingClassifier

model = HistGradientBoostingClassifier(
    max_iter=200,
    learning_rate=0.1,
    max_depth=5,
    min_samples_leaf=20,
    random_state=42
)
model.fit(X_train, y_train)

print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")

3. Random Forest

Description: A specific type of bagging that uses decision trees as base learners and adds an extra layer of randomness: at each split, only a random subset of features is considered. This decorrelates the trees, making the ensemble more robust than plain bagging of trees.

When to Use

General-purpose classification and regression
When you need feature importance rankings
When you want a model that works well with minimal tuning
When you need robustness to outliers and noise

Key Innovations Over Bagging

Feature randomness: Each split considers only sqrt(p) features (classification) or p/3 (regression)
Tree decorrelation: Trees become more independent, improving variance reduction
OOB error: Built-in validation using out-of-bag samples

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,          # Grow full trees
    max_features='sqrt',     # sqrt(n_features) per split
    min_samples_leaf=2,
    oob_score=True,
    random_state=42,
    n_jobs=-1
)
model.fit(X_train, y_train)

print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")
print(f"OOB Score: {model.oob_score_:.4f}")

# Feature importance
import numpy as np
feature_names = load_breast_cancer().feature_names
importances = model.feature_importances_
top_idx = np.argsort(importances)[-5:][::-1]
for i in top_idx:
    print(f"  {feature_names[i]}: {importances[i]:.4f}")

4. Gradient Boosting

Description: Builds the ensemble by fitting each new tree to the negative gradient (residuals) of the loss function with respect to the current ensemble's predictions. This gradient descent in function space approach is more general than AdaBoost and supports any differentiable loss function.

When to Use

When maximum predictive accuracy is the goal
Structured/tabular data
When you can invest time in hyperparameter tuning
Competitions (often the winning approach)

from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(
    n_estimators=300,
    learning_rate=0.05,       # Shrinkage: smaller = more robust
    max_depth=3,              # Shallow trees as weak learners
    subsample=0.8,            # Stochastic gradient boosting
    min_samples_leaf=10,
    max_features='sqrt',
    random_state=42
)
model.fit(X_train, y_train)

print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")

# Staged prediction (show performance vs n_estimators)
from sklearn.metrics import accuracy_score
staged_scores = [accuracy_score(y_test, pred)
                 for pred in model.staged_predict(X_test)]
print(f"Best at iteration {np.argmax(staged_scores)+1}: {max(staged_scores):.4f}")

5. AdaBoost (Adaptive Boosting)

Description: The original boosting algorithm. Assigns equal weights to all training samples initially. After each weak learner is trained, it increases weights on misclassified samples and decreases weights on correctly classified ones. The final prediction is a weighted vote of all weak learners.

When to Use

When you have a simple base learner to boost
Binary classification problems
When you want a boosting method that is less prone to overfitting than gradient boosting
Face detection (historically important: Viola-Jones algorithm)

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

model = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # Decision stumps
    n_estimators=200,
    learning_rate=0.1,
    algorithm='SAMME',
    random_state=42
)
model.fit(X_train, y_train)

print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")

# Show how each estimator contributes
print(f"Estimator weights (first 10): {model.estimator_weights_[:10].round(3)}")
print(f"Estimator errors (first 10): {model.estimator_errors_[:10].round(3)}")

6. Stacking (Stacked Generalization)

Description: Uses predictions from multiple diverse base models as input features for a meta-learner (second-level model). The meta-learner learns the optimal way to combine the base models' predictions. Uses cross-validation to generate base model predictions to avoid data leakage.

When to Use

When you want to combine fundamentally different model types
When individual models have complementary strengths
Competition settings where marginal improvements matter
When you have enough data for reliable cross-validation

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# Define base models (level-0)
base_models = [
    ('dt', DecisionTreeClassifier(max_depth=5, random_state=42)),
    ('knn', KNeighborsClassifier(n_neighbors=5)),
    ('svc', SVC(kernel='rbf', probability=True, random_state=42)),
]

# Define meta-learner (level-1)
meta_learner = LogisticRegression(max_iter=1000)

model = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_learner,
    cv=5,                    # Cross-validation folds for base predictions
    stack_method='auto',     # Use predict_proba if available
    n_jobs=-1
)
model.fit(X_train, y_train)

print(f"Stacking Accuracy: {model.score(X_test, y_test):.4f}")

# Compare with individual models
for name, est in base_models:
    est.fit(X_train, y_train)
    print(f"  {name} alone: {est.score(X_test, y_test):.4f}")

7. Voting Classifier

Description: Combines predictions from multiple different model types. Hard voting uses majority vote on predicted classes. Soft voting averages predicted probabilities (generally more accurate). Unlike stacking, voting does not train a meta-learner -- it simply aggregates.

When to Use

When you have several well-tuned but different models
Quick ensemble without the complexity of stacking
When models make different types of errors (low correlation in predictions)

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# Hard Voting
model_hard = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(max_iter=1000)),
        ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
        ('svc', SVC(kernel='rbf', random_state=42)),
    ],
    voting='hard',
    n_jobs=-1
)
model_hard.fit(X_train, y_train)
print(f"Hard Voting Accuracy: {model_hard.score(X_test, y_test):.4f}")

# Soft Voting (requires probability support)
model_soft = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(max_iter=1000)),
        ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
        ('svc', SVC(kernel='rbf', probability=True, random_state=42)),
    ],
    voting='soft',
    weights=[1, 2, 1],      # Give more weight to Random Forest
    n_jobs=-1
)
model_soft.fit(X_train, y_train)
print(f"Soft Voting Accuracy: {model_soft.score(X_test, y_test):.4f}")

← Previous: Dimensionality Reduction Next: Reinforcement Learning →