Ensemble Methods (7)

Combining multiple models for superior performance

Ensemble methods combine multiple base learners to produce a model that is more accurate and robust than any individual model. The key insight: a group of weak learners can form a strong learner. Ensembles consistently win ML competitions and power production ML systems.

Ensemble Strategies Overview

StrategyHow It WorksReducesExamples
BaggingTrain models on bootstrap samples in parallel, aggregate by voting/averagingVarianceRandom Forest, Bagging Classifier
BoostingTrain models sequentially, each correcting previous errorsBiasAdaBoost, Gradient Boosting, XGBoost
StackingTrain a meta-learner on predictions of base modelsBothStacking Classifier/Regressor
VotingCombine predictions of different model types by voting/averagingVarianceVoting Classifier/Regressor

1. Bagging (Bootstrap Aggregating)

Description: Creates multiple bootstrap samples (random sampling with replacement) from the training data, trains a separate base model on each sample, then combines predictions by majority voting (classification) or averaging (regression). Reduces variance and overfitting.

When to Use

  • When your base model has high variance (e.g., deep decision trees)
  • When you want to reduce overfitting without sacrificing complexity
  • When you can afford the computational cost of training multiple models

How It Works

  1. Create T bootstrap samples by randomly sampling n points with replacement
  2. Train one base model on each bootstrap sample
  3. Combine predictions: vote (classification) or average (regression)
  4. Out-of-bag (OOB) samples can be used for validation without a separate test set
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Bagging with Decision Trees
model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    max_samples=0.8,         # 80% of data per bootstrap sample
    max_features=0.8,        # 80% of features per model
    bootstrap=True,
    oob_score=True,          # Use out-of-bag samples for scoring
    random_state=42,
    n_jobs=-1
)
model.fit(X_train, y_train)

print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")
print(f"OOB Score: {model.oob_score_:.4f}")

2. Boosting (General Concept)

Description: A family of algorithms that sequentially train weak learners, with each new learner focusing on the mistakes of the previous ensemble. The key idea is to convert weak learners (slightly better than random) into a strong learner. Different boosting variants differ in how they weight errors and combine predictions.

When to Use

  • When you need to reduce bias (underfitting)
  • When you have a weak base learner you want to improve
  • When accuracy is more important than interpretability
  • For structured/tabular data competitions

Boosting vs. Bagging

AspectBaggingBoosting
TrainingParallelSequential
ReducesVarianceBias (and variance)
Base learnersStrong (deep trees)Weak (stumps/shallow trees)
Overfitting riskLowerHigher (needs tuning)
WeightingEqual weightWeighted by performance
# General boosting example with HistGradientBoosting (sklearn's fast boosting)
from sklearn.ensemble import HistGradientBoostingClassifier

model = HistGradientBoostingClassifier(
    max_iter=200,
    learning_rate=0.1,
    max_depth=5,
    min_samples_leaf=20,
    random_state=42
)
model.fit(X_train, y_train)

print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")

3. Random Forest

Description: A specific type of bagging that uses decision trees as base learners and adds an extra layer of randomness: at each split, only a random subset of features is considered. This decorrelates the trees, making the ensemble more robust than plain bagging of trees.

When to Use

  • General-purpose classification and regression
  • When you need feature importance rankings
  • When you want a model that works well with minimal tuning
  • When you need robustness to outliers and noise

Key Innovations Over Bagging

  • Feature randomness: Each split considers only sqrt(p) features (classification) or p/3 (regression)
  • Tree decorrelation: Trees become more independent, improving variance reduction
  • OOB error: Built-in validation using out-of-bag samples
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,          # Grow full trees
    max_features='sqrt',     # sqrt(n_features) per split
    min_samples_leaf=2,
    oob_score=True,
    random_state=42,
    n_jobs=-1
)
model.fit(X_train, y_train)

print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")
print(f"OOB Score: {model.oob_score_:.4f}")

# Feature importance
import numpy as np
feature_names = load_breast_cancer().feature_names
importances = model.feature_importances_
top_idx = np.argsort(importances)[-5:][::-1]
for i in top_idx:
    print(f"  {feature_names[i]}: {importances[i]:.4f}")

4. Gradient Boosting

Description: Builds the ensemble by fitting each new tree to the negative gradient (residuals) of the loss function with respect to the current ensemble's predictions. This gradient descent in function space approach is more general than AdaBoost and supports any differentiable loss function.

When to Use

  • When maximum predictive accuracy is the goal
  • Structured/tabular data
  • When you can invest time in hyperparameter tuning
  • Competitions (often the winning approach)
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(
    n_estimators=300,
    learning_rate=0.05,       # Shrinkage: smaller = more robust
    max_depth=3,              # Shallow trees as weak learners
    subsample=0.8,            # Stochastic gradient boosting
    min_samples_leaf=10,
    max_features='sqrt',
    random_state=42
)
model.fit(X_train, y_train)

print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")

# Staged prediction (show performance vs n_estimators)
from sklearn.metrics import accuracy_score
staged_scores = [accuracy_score(y_test, pred)
                 for pred in model.staged_predict(X_test)]
print(f"Best at iteration {np.argmax(staged_scores)+1}: {max(staged_scores):.4f}")

5. AdaBoost (Adaptive Boosting)

Description: The original boosting algorithm. Assigns equal weights to all training samples initially. After each weak learner is trained, it increases weights on misclassified samples and decreases weights on correctly classified ones. The final prediction is a weighted vote of all weak learners.

When to Use

  • When you have a simple base learner to boost
  • Binary classification problems
  • When you want a boosting method that is less prone to overfitting than gradient boosting
  • Face detection (historically important: Viola-Jones algorithm)
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

model = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # Decision stumps
    n_estimators=200,
    learning_rate=0.1,
    algorithm='SAMME',
    random_state=42
)
model.fit(X_train, y_train)

print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")

# Show how each estimator contributes
print(f"Estimator weights (first 10): {model.estimator_weights_[:10].round(3)}")
print(f"Estimator errors (first 10): {model.estimator_errors_[:10].round(3)}")

6. Stacking (Stacked Generalization)

Description: Uses predictions from multiple diverse base models as input features for a meta-learner (second-level model). The meta-learner learns the optimal way to combine the base models' predictions. Uses cross-validation to generate base model predictions to avoid data leakage.

When to Use

  • When you want to combine fundamentally different model types
  • When individual models have complementary strengths
  • Competition settings where marginal improvements matter
  • When you have enough data for reliable cross-validation
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# Define base models (level-0)
base_models = [
    ('dt', DecisionTreeClassifier(max_depth=5, random_state=42)),
    ('knn', KNeighborsClassifier(n_neighbors=5)),
    ('svc', SVC(kernel='rbf', probability=True, random_state=42)),
]

# Define meta-learner (level-1)
meta_learner = LogisticRegression(max_iter=1000)

model = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_learner,
    cv=5,                    # Cross-validation folds for base predictions
    stack_method='auto',     # Use predict_proba if available
    n_jobs=-1
)
model.fit(X_train, y_train)

print(f"Stacking Accuracy: {model.score(X_test, y_test):.4f}")

# Compare with individual models
for name, est in base_models:
    est.fit(X_train, y_train)
    print(f"  {name} alone: {est.score(X_test, y_test):.4f}")

7. Voting Classifier

Description: Combines predictions from multiple different model types. Hard voting uses majority vote on predicted classes. Soft voting averages predicted probabilities (generally more accurate). Unlike stacking, voting does not train a meta-learner -- it simply aggregates.

When to Use

  • When you have several well-tuned but different models
  • Quick ensemble without the complexity of stacking
  • When models make different types of errors (low correlation in predictions)
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# Hard Voting
model_hard = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(max_iter=1000)),
        ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
        ('svc', SVC(kernel='rbf', random_state=42)),
    ],
    voting='hard',
    n_jobs=-1
)
model_hard.fit(X_train, y_train)
print(f"Hard Voting Accuracy: {model_hard.score(X_test, y_test):.4f}")

# Soft Voting (requires probability support)
model_soft = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(max_iter=1000)),
        ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
        ('svc', SVC(kernel='rbf', probability=True, random_state=42)),
    ],
    voting='soft',
    weights=[1, 2, 1],      # Give more weight to Random Forest
    n_jobs=-1
)
model_soft.fit(X_train, y_train)
print(f"Soft Voting Accuracy: {model_soft.score(X_test, y_test):.4f}")