Intermediate

ML Pipeline API

Master the Spark ML Pipeline architecture — transformers, estimators, pipelines, evaluators, and the design patterns that make ML workflows reproducible and scalable.

Core Concepts

The Spark ML Pipeline API is built on four key abstractions. Understanding the difference between these is one of the most heavily tested topics on the exam.

Transformer

Takes a DataFrame and produces a new DataFrame with added/modified columns. Has a .transform(df) method. Example: VectorAssembler, StringIndexerModel, fitted StandardScaler.

Estimator

Takes a DataFrame and produces a Transformer (a fitted model). Has a .fit(df) method. Example: LogisticRegression, StringIndexer, StandardScaler (unfitted).

Pipeline

A sequence of stages (transformers and estimators). When you call .fit(), it processes each stage in order, producing a PipelineModel.

Evaluator

Measures model performance. Takes a DataFrame with predictions and returns a metric. Example: BinaryClassificationEvaluator, RegressionEvaluator.

💡
Key distinction: An Estimator learns from data (fit) and produces a Transformer. A Transformer applies learned parameters to data (transform). StringIndexer is an Estimator. StringIndexerModel is the Transformer it produces after fitting.

Building a Pipeline

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression

# Stage 1: Estimator - learns category mappings
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")

# Stage 2: Transformer - assembles features
assembler = VectorAssembler(
    inputCols=["age", "salary", "categoryIndex"],
    outputCol="features"
)

# Stage 3: Estimator - trains the model
lr = LogisticRegression(featuresCol="features", labelCol="label")

# Build pipeline
pipeline = Pipeline(stages=[indexer, assembler, lr])

# Fit pipeline (runs all stages)
model = pipeline.fit(train_df)

# Transform new data (applies all fitted stages)
predictions = model.transform(test_df)
Exam trap: When pipeline.fit() is called, Estimator stages call .fit() and are replaced with their resulting Transformers. Transformer stages are passed through unchanged. The result is a PipelineModel where all stages are Transformers.

Evaluators

from pyspark.ml.evaluation import (
    BinaryClassificationEvaluator,
    MulticlassClassificationEvaluator,
    RegressionEvaluator
)

# Binary classification
binary_eval = BinaryClassificationEvaluator(
    rawPredictionCol="rawPrediction",
    labelCol="label",
    metricName="areaUnderROC"  # or "areaUnderPR"
)
auc = binary_eval.evaluate(predictions)

# Multiclass classification
multi_eval = MulticlassClassificationEvaluator(
    predictionCol="prediction",
    labelCol="label",
    metricName="accuracy"  # or "f1", "weightedPrecision", "weightedRecall"
)

# Regression
reg_eval = RegressionEvaluator(
    predictionCol="prediction",
    labelCol="label",
    metricName="rmse"  # or "mse", "r2", "mae"
)

Saving and Loading Pipelines

# Save fitted pipeline model
model.save("/path/to/model")

# Load pipeline model
from pyspark.ml import PipelineModel
loaded_model = PipelineModel.load("/path/to/model")

# Use loaded model for predictions
predictions = loaded_model.transform(new_data)

Practice Questions

Question 1

Q1
What is the difference between a Transformer and an Estimator in Spark ML?

A) Transformers are faster than Estimators
B) Estimators have a fit() method that learns from data and produces a Transformer; Transformers have a transform() method that applies learned parameters
C) Transformers handle numerical data; Estimators handle categorical data
D) There is no difference; they are interchangeable

Answer: B — This is the fundamental distinction. Estimators learn from data (fit) and produce Transformers. Transformers apply learned parameters to data (transform). For example, StandardScaler (Estimator) fits on data to learn mean/std, producing a StandardScalerModel (Transformer) that applies the scaling.

Question 2

Q2
After calling pipeline.fit(train_df), what type of object is returned?

A) A Pipeline
B) A PipelineModel (where all Estimator stages have been replaced with fitted Transformers)
C) A DataFrame with predictions
D) An Evaluator

Answer: Bpipeline.fit() returns a PipelineModel. During fitting, each Estimator stage is fit on the data and replaced with its resulting Transformer. The PipelineModel contains only Transformers and can be used to transform new data.

Question 3

Q3
Which Evaluator class and metric should you use to measure RMSE for a regression model?

A) BinaryClassificationEvaluator with metricName="rmse"
B) RegressionEvaluator with metricName="rmse"
C) MulticlassClassificationEvaluator with metricName="rmse"
D) ClusteringEvaluator with metricName="rmse"

Answer: B — RegressionEvaluator supports metrics like rmse, mse, r2, and mae. BinaryClassificationEvaluator is for binary classification (areaUnderROC, areaUnderPR). MulticlassClassificationEvaluator is for accuracy, f1, etc.

Question 4

Q4
Is VectorAssembler a Transformer or an Estimator?

A) Estimator — it learns from data
B) Transformer — it directly transforms data without learning
C) Evaluator — it measures performance
D) Pipeline — it chains multiple stages

Answer: B — VectorAssembler is a Transformer. It simply combines specified input columns into a single feature vector column. It does not learn anything from the data (no fit step needed). It has only a transform() method.