Intermediate

Feature Engineering

Master Spark's feature transformation toolkit — VectorAssembler, StringIndexer, OneHotEncoder, Bucketizer, StandardScaler, and Imputer. Know which class to use for each scenario.

Feature Transformation Classes

Spark ML provides a rich set of feature transformation classes. The exam frequently tests which class to use for a given scenario and whether each is a Transformer or Estimator.

VectorAssembler (Transformer)

Combines multiple columns into a single feature vector column. Required because Spark ML algorithms expect a single features column containing a vector.

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=["age", "salary", "years_experience"],
    outputCol="features"
)
df_assembled = assembler.transform(df)

StringIndexer (Estimator)

Converts a string column to a numeric index column. Maps each unique string to a number (most frequent = 0).

from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="city", outputCol="cityIndex")
model = indexer.fit(df)  # Learns the mapping
df_indexed = model.transform(df)  # Applies the mapping
# "New York" -> 0, "London" -> 1, "Tokyo" -> 2

💡

Exam tip: StringIndexer is an Estimator because it must learn the string-to-index mapping from data. After fitting, it produces a StringIndexerModel (Transformer). The handleInvalid parameter controls behavior for unseen strings: "error" (default), "skip", or "keep".

OneHotEncoder (Estimator)

Converts a numeric index column to a sparse binary vector. Typically used after StringIndexer for categorical features in linear models.

from pyspark.ml.feature import OneHotEncoder

encoder = OneHotEncoder(inputCol="cityIndex", outputCol="cityVec")
model = encoder.fit(df_indexed)
df_encoded = model.transform(df_indexed)
# cityIndex 0 -> [1, 0, 0], cityIndex 1 -> [0, 1, 0]

⚠

Common exam trap: OneHotEncoder uses n-1 encoding by default (dropLast=True). For 3 categories, it produces a vector of length 2, not 3. This avoids the dummy variable trap in linear models. The exam tests this.

Bucketizer (Transformer)

Converts continuous values into discrete bins based on specified boundaries.

from pyspark.ml.feature import Bucketizer

bucketizer = Bucketizer(
    splits=[0, 18, 35, 55, 100],  # Age bins: [0-18), [18-35), [35-55), [55-100)
    inputCol="age",
    outputCol="ageBucket"
)
df_bucketed = bucketizer.transform(df)

StandardScaler (Estimator)

Standardizes features by subtracting mean and dividing by standard deviation (z-score normalization).

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(
    inputCol="features",
    outputCol="scaledFeatures",
    withStd=True,
    withMean=True
)
scaler_model = scaler.fit(df)  # Learns mean and std
df_scaled = scaler_model.transform(df)

Imputer (Estimator)

Replaces missing (NaN) values with the mean or median of the column.

from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols=["age", "salary"],
    outputCols=["age_imputed", "salary_imputed"],
    strategy="median"  # or "mean", "mode"
)
imputer_model = imputer.fit(df)
df_imputed = imputer_model.transform(df)

Transformer vs. Estimator Cheat Sheet

Transformers (no fit needed)

VectorAssembler
Bucketizer
SQLTransformer
Tokenizer
Binarizer

Estimators (fit required)

StringIndexer
OneHotEncoder
StandardScaler
MinMaxScaler
Imputer

Practice Questions

Question 1

A dataset has a "color" column with values "red", "blue", and "green". You need to convert this to a numeric representation for a logistic regression model. Which TWO Spark ML classes should you use in sequence?

A) VectorAssembler then StandardScaler
B) StringIndexer then OneHotEncoder
C) Bucketizer then VectorAssembler
D) Imputer then StringIndexer

Answer: B — First, StringIndexer converts strings to numeric indices ("red"->0, "blue"->1, "green"->2). Then, OneHotEncoder converts the index to a sparse binary vector. This two-step process is the standard pattern for encoding categorical features in Spark ML.

Question 2

OneHotEncoder with dropLast=True encodes 4 categories. How many dimensions does the output vector have?

A) 4
B) 3
C) 5
D) 2

Answer: B — With dropLast=True (the default), OneHotEncoder produces n-1 dimensions for n categories. For 4 categories, the output vector has 3 dimensions. The last category is represented by all zeros, avoiding multicollinearity in linear models.

Question 3

Which Spark ML class is used to combine multiple numeric columns into a single feature vector column?

A) StringIndexer
B) OneHotEncoder
C) VectorAssembler
D) StandardScaler

Answer: C — VectorAssembler combines multiple columns (numeric, boolean, or vector) into a single vector column. Spark ML algorithms require a single features column, so VectorAssembler is used in nearly every ML pipeline.

Question 4

A dataset has missing salary values (NaN). Which Spark ML class should you use to fill missing values with the column median?

A) VectorAssembler
B) Bucketizer
C) Imputer with strategy="median"
D) StringIndexer

Answer: C — Imputer replaces NaN values with the computed statistic (mean, median, or mode). Setting strategy="median" fills missing values with the column's median. VectorAssembler and Bucketizer do not handle missing values. StringIndexer is for categorical data.

Question 5

Is Bucketizer a Transformer or an Estimator?

A) Estimator — it learns bucket boundaries from data
B) Transformer — it applies user-specified boundaries directly without learning
C) Evaluator — it measures binning quality
D) Pipeline — it chains multiple stages

Answer: B — Bucketizer is a Transformer because the bucket boundaries (splits) are specified by the user, not learned from data. It directly transforms continuous values into discrete bins using the provided boundaries. No fit step is needed.

← Previous ML Pipeline API Next → Model Training