Intermediate
Pipelines
Build end-to-end ML workflows that chain preprocessing and modeling steps, prevent data leakage, and simplify deployment with Pipeline, ColumnTransformer, and FeatureUnion.
Why Pipelines?
Pipelines solve three critical problems in ML workflows: preventing data leakage during cross-validation, ensuring reproducibility, and simplifying deployment by packaging all transformations with the model.
Basic Pipeline
Python
from sklearn.pipeline import Pipeline, make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC # Explicit Pipeline with named steps pipe = Pipeline([ ("scaler", StandardScaler()), ("classifier", SVC(kernel="rbf")) ]) # Shorthand with make_pipeline (auto-names steps) pipe = make_pipeline(StandardScaler(), SVC(kernel="rbf")) # Use like any estimator pipe.fit(X_train, y_train) score = pipe.score(X_test, y_test) # Cross-validate the entire pipeline (no leakage!) scores = cross_val_score(pipe, X, y, cv=5)
ColumnTransformer
Apply different transformations to different columns — essential for real-world datasets with mixed types:
Python
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer numeric_features = ["age", "income", "balance"] categorical_features = ["city", "occupation"] preprocessor = ColumnTransformer([ ("num", make_pipeline( SimpleImputer(strategy="median"), StandardScaler() ), numeric_features), ("cat", make_pipeline( SimpleImputer(strategy="most_frequent"), OneHotEncoder(handle_unknown="ignore") ), categorical_features) ]) # Full pipeline: preprocess + model full_pipe = make_pipeline(preprocessor, RandomForestClassifier()) full_pipe.fit(X_train, y_train)
Hyperparameter Tuning with Pipelines
Python
# Access nested parameters with double underscore param_grid = { "columntransformer__num__simpleimputer__strategy": ["mean", "median"], "randomforestclassifier__n_estimators": [100, 200], "randomforestclassifier__max_depth": [5, 10, None] } grid = GridSearchCV(full_pipe, param_grid, cv=5, n_jobs=-1) grid.fit(X_train, y_train)
FeatureUnion
Combine multiple feature extraction pipelines in parallel:
Python
from sklearn.pipeline import FeatureUnion # Combine different feature extraction methods features = FeatureUnion([ ("pca", PCA(n_components=10)), ("kbest", SelectKBest(k=5)) ]) pipe = make_pipeline(StandardScaler(), features, LogisticRegression())
Pipeline Benefits: 1) No data leakage in cross-validation. 2) Single object to serialize for deployment. 3) Reproducible from raw data to prediction. 4) Easy to swap components.
Next: Advanced Features
Learn to build custom estimators, use ensemble methods, and leverage scikit-learn's advanced capabilities.
Next: Advanced Features →
Lilly Tech Systems