Intermediate

Auto-sklearn

Auto-sklearn wraps scikit-learn with Bayesian optimization and meta-learning to automatically find the best ML pipeline for your data.

What is Auto-sklearn?

Auto-sklearn is an automated machine learning toolkit built on top of scikit-learn. It searches over 15 classifiers, 14 feature preprocessing methods, and their hyperparameters using two key innovations:

  • Meta-learning: Uses knowledge from 140+ previous datasets to warm-start the search. It identifies which algorithms work well on datasets similar to yours.
  • Bayesian optimization (SMAC): Sequentially selects pipeline configurations to evaluate, learning from past evaluations to focus on promising regions.
  • Automatic ensemble construction: Combines the top-performing models into an ensemble for better generalization.
Python - Auto-sklearn Classification
import autosklearn.classification
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Auto-sklearn handles everything automatically
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=300,  # 5 minutes total
    per_run_time_limit=30,       # 30 seconds per model
    n_jobs=-1,                    # Use all CPU cores
    ensemble_size=20,            # Ensemble top 20 models
    memory_limit=4096,           # 4 GB memory limit
)

automl.fit(X_train, y_train)
predictions = automl.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions):.4f}")

# Inspect the models found
print(automl.leaderboard())
print(automl.show_models())

How Auto-sklearn Works

  1. Meta-Learning Initialization

    Computes dataset meta-features (number of features, classes, skewness, etc.) and identifies similar datasets from its knowledge base. Starts search with configurations that worked well on similar data.

  2. Bayesian Optimization (SMAC)

    Uses Sequential Model-based Algorithm Configuration to efficiently explore the space of preprocessors, algorithms, and hyperparameters.

  3. Pipeline Construction

    Each candidate pipeline includes: data preprocessing → feature preprocessing → classifier, all with tuned hyperparameters.

  4. Ensemble Selection

    After the search budget is exhausted, builds an ensemble using greedy forward selection from the best models found.

Auto-sklearn 2.0 Improvements

  • Portfolio-based initialization: Improved warm-starting strategy using a portfolio of complementary configurations.
  • Early stopping: Stops training models that show poor initial performance, saving compute.
  • Better ensembles: Improved ensemble construction leads to more robust predictions.

Limitations

  • Linux only: Auto-sklearn requires Linux (or WSL on Windows). It does not support macOS or Windows natively.
  • Tabular data only: Not designed for images, text, or time series. Use specialized tools for those.
  • Scikit-learn ecosystem: Limited to algorithms available in scikit-learn. Does not include XGBoost, LightGBM, or deep learning models.
  • Memory intensive: Training many models simultaneously requires significant RAM.
Key takeaway: Auto-sklearn is an excellent choice for tabular classification and regression tasks within the scikit-learn ecosystem. Its meta-learning capabilities give it a head start, and automatic ensembling often produces strong results with minimal configuration.