Auto-sklearn
Auto-sklearn wraps scikit-learn with Bayesian optimization and meta-learning to automatically find the best ML pipeline for your data.
What is Auto-sklearn?
Auto-sklearn is an automated machine learning toolkit built on top of scikit-learn. It searches over 15 classifiers, 14 feature preprocessing methods, and their hyperparameters using two key innovations:
- Meta-learning: Uses knowledge from 140+ previous datasets to warm-start the search. It identifies which algorithms work well on datasets similar to yours.
- Bayesian optimization (SMAC): Sequentially selects pipeline configurations to evaluate, learning from past evaluations to focus on promising regions.
- Automatic ensemble construction: Combines the top-performing models into an ensemble for better generalization.
import autosklearn.classification from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score X, y = load_breast_cancer(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Auto-sklearn handles everything automatically automl = autosklearn.classification.AutoSklearnClassifier( time_left_for_this_task=300, # 5 minutes total per_run_time_limit=30, # 30 seconds per model n_jobs=-1, # Use all CPU cores ensemble_size=20, # Ensemble top 20 models memory_limit=4096, # 4 GB memory limit ) automl.fit(X_train, y_train) predictions = automl.predict(X_test) print(f"Accuracy: {accuracy_score(y_test, predictions):.4f}") # Inspect the models found print(automl.leaderboard()) print(automl.show_models())
How Auto-sklearn Works
Meta-Learning Initialization
Computes dataset meta-features (number of features, classes, skewness, etc.) and identifies similar datasets from its knowledge base. Starts search with configurations that worked well on similar data.
Bayesian Optimization (SMAC)
Uses Sequential Model-based Algorithm Configuration to efficiently explore the space of preprocessors, algorithms, and hyperparameters.
Pipeline Construction
Each candidate pipeline includes: data preprocessing → feature preprocessing → classifier, all with tuned hyperparameters.
Ensemble Selection
After the search budget is exhausted, builds an ensemble using greedy forward selection from the best models found.
Auto-sklearn 2.0 Improvements
- Portfolio-based initialization: Improved warm-starting strategy using a portfolio of complementary configurations.
- Early stopping: Stops training models that show poor initial performance, saving compute.
- Better ensembles: Improved ensemble construction leads to more robust predictions.
Limitations
- Linux only: Auto-sklearn requires Linux (or WSL on Windows). It does not support macOS or Windows natively.
- Tabular data only: Not designed for images, text, or time series. Use specialized tools for those.
- Scikit-learn ecosystem: Limited to algorithms available in scikit-learn. Does not include XGBoost, LightGBM, or deep learning models.
- Memory intensive: Training many models simultaneously requires significant RAM.