Intermediate

Pipelines

Build end-to-end ML workflows that chain preprocessing and modeling steps, prevent data leakage, and simplify deployment with Pipeline, ColumnTransformer, and FeatureUnion.

Why Pipelines?

Pipelines solve three critical problems in ML workflows: preventing data leakage during cross-validation, ensuring reproducibility, and simplifying deployment by packaging all transformations with the model.

Basic Pipeline

Python
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# Explicit Pipeline with named steps
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("classifier", SVC(kernel="rbf"))
])

# Shorthand with make_pipeline (auto-names steps)
pipe = make_pipeline(StandardScaler(), SVC(kernel="rbf"))

# Use like any estimator
pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)

# Cross-validate the entire pipeline (no leakage!)
scores = cross_val_score(pipe, X, y, cv=5)

ColumnTransformer

Apply different transformations to different columns — essential for real-world datasets with mixed types:

Python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

numeric_features = ["age", "income", "balance"]
categorical_features = ["city", "occupation"]

preprocessor = ColumnTransformer([
    ("num", make_pipeline(
        SimpleImputer(strategy="median"),
        StandardScaler()
    ), numeric_features),
    ("cat", make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OneHotEncoder(handle_unknown="ignore")
    ), categorical_features)
])

# Full pipeline: preprocess + model
full_pipe = make_pipeline(preprocessor, RandomForestClassifier())
full_pipe.fit(X_train, y_train)

Hyperparameter Tuning with Pipelines

Python
# Access nested parameters with double underscore
param_grid = {
    "columntransformer__num__simpleimputer__strategy": ["mean", "median"],
    "randomforestclassifier__n_estimators": [100, 200],
    "randomforestclassifier__max_depth": [5, 10, None]
}

grid = GridSearchCV(full_pipe, param_grid, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)

FeatureUnion

Combine multiple feature extraction pipelines in parallel:

Python
from sklearn.pipeline import FeatureUnion

# Combine different feature extraction methods
features = FeatureUnion([
    ("pca", PCA(n_components=10)),
    ("kbest", SelectKBest(k=5))
])

pipe = make_pipeline(StandardScaler(), features, LogisticRegression())
Pipeline Benefits: 1) No data leakage in cross-validation. 2) Single object to serialize for deployment. 3) Reproducible from raw data to prediction. 4) Easy to swap components.

Next: Advanced Features

Learn to build custom estimators, use ensemble methods, and leverage scikit-learn's advanced capabilities.

Next: Advanced Features →