ML Pipeline API
Master the Spark ML Pipeline architecture — transformers, estimators, pipelines, evaluators, and the design patterns that make ML workflows reproducible and scalable.
Core Concepts
The Spark ML Pipeline API is built on four key abstractions. Understanding the difference between these is one of the most heavily tested topics on the exam.
Transformer
Takes a DataFrame and produces a new DataFrame with added/modified columns. Has a .transform(df) method. Example: VectorAssembler, StringIndexerModel, fitted StandardScaler.
Estimator
Takes a DataFrame and produces a Transformer (a fitted model). Has a .fit(df) method. Example: LogisticRegression, StringIndexer, StandardScaler (unfitted).
Pipeline
A sequence of stages (transformers and estimators). When you call .fit(), it processes each stage in order, producing a PipelineModel.
Evaluator
Measures model performance. Takes a DataFrame with predictions and returns a metric. Example: BinaryClassificationEvaluator, RegressionEvaluator.
Building a Pipeline
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
# Stage 1: Estimator - learns category mappings
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
# Stage 2: Transformer - assembles features
assembler = VectorAssembler(
inputCols=["age", "salary", "categoryIndex"],
outputCol="features"
)
# Stage 3: Estimator - trains the model
lr = LogisticRegression(featuresCol="features", labelCol="label")
# Build pipeline
pipeline = Pipeline(stages=[indexer, assembler, lr])
# Fit pipeline (runs all stages)
model = pipeline.fit(train_df)
# Transform new data (applies all fitted stages)
predictions = model.transform(test_df)
pipeline.fit() is called, Estimator stages call .fit() and are replaced with their resulting Transformers. Transformer stages are passed through unchanged. The result is a PipelineModel where all stages are Transformers.Evaluators
from pyspark.ml.evaluation import (
BinaryClassificationEvaluator,
MulticlassClassificationEvaluator,
RegressionEvaluator
)
# Binary classification
binary_eval = BinaryClassificationEvaluator(
rawPredictionCol="rawPrediction",
labelCol="label",
metricName="areaUnderROC" # or "areaUnderPR"
)
auc = binary_eval.evaluate(predictions)
# Multiclass classification
multi_eval = MulticlassClassificationEvaluator(
predictionCol="prediction",
labelCol="label",
metricName="accuracy" # or "f1", "weightedPrecision", "weightedRecall"
)
# Regression
reg_eval = RegressionEvaluator(
predictionCol="prediction",
labelCol="label",
metricName="rmse" # or "mse", "r2", "mae"
)
Saving and Loading Pipelines
# Save fitted pipeline model
model.save("/path/to/model")
# Load pipeline model
from pyspark.ml import PipelineModel
loaded_model = PipelineModel.load("/path/to/model")
# Use loaded model for predictions
predictions = loaded_model.transform(new_data)
Practice Questions
Question 1
A) Transformers are faster than Estimators
B) Estimators have a fit() method that learns from data and produces a Transformer; Transformers have a transform() method that applies learned parameters
C) Transformers handle numerical data; Estimators handle categorical data
D) There is no difference; they are interchangeable
Answer: B — This is the fundamental distinction. Estimators learn from data (fit) and produce Transformers. Transformers apply learned parameters to data (transform). For example, StandardScaler (Estimator) fits on data to learn mean/std, producing a StandardScalerModel (Transformer) that applies the scaling.
Question 2
pipeline.fit(train_df), what type of object is returned?A) A Pipeline
B) A PipelineModel (where all Estimator stages have been replaced with fitted Transformers)
C) A DataFrame with predictions
D) An Evaluator
Answer: B —
pipeline.fit() returns a PipelineModel. During fitting, each Estimator stage is fit on the data and replaced with its resulting Transformer. The PipelineModel contains only Transformers and can be used to transform new data.
Question 3
A) BinaryClassificationEvaluator with metricName="rmse"
B) RegressionEvaluator with metricName="rmse"
C) MulticlassClassificationEvaluator with metricName="rmse"
D) ClusteringEvaluator with metricName="rmse"
Answer: B — RegressionEvaluator supports metrics like rmse, mse, r2, and mae. BinaryClassificationEvaluator is for binary classification (areaUnderROC, areaUnderPR). MulticlassClassificationEvaluator is for accuracy, f1, etc.
Question 4
A) Estimator — it learns from data
B) Transformer — it directly transforms data without learning
C) Evaluator — it measures performance
D) Pipeline — it chains multiple stages
Answer: B — VectorAssembler is a Transformer. It simply combines specified input columns into a single feature vector column. It does not learn anything from the data (no fit step needed). It has only a transform() method.
Lilly Tech Systems