Feature Engineering
Master Spark's feature transformation toolkit — VectorAssembler, StringIndexer, OneHotEncoder, Bucketizer, StandardScaler, and Imputer. Know which class to use for each scenario.
Feature Transformation Classes
Spark ML provides a rich set of feature transformation classes. The exam frequently tests which class to use for a given scenario and whether each is a Transformer or Estimator.
VectorAssembler (Transformer)
Combines multiple columns into a single feature vector column. Required because Spark ML algorithms expect a single features column containing a vector.
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=["age", "salary", "years_experience"],
outputCol="features"
)
df_assembled = assembler.transform(df)
StringIndexer (Estimator)
Converts a string column to a numeric index column. Maps each unique string to a number (most frequent = 0).
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="city", outputCol="cityIndex")
model = indexer.fit(df) # Learns the mapping
df_indexed = model.transform(df) # Applies the mapping
# "New York" -> 0, "London" -> 1, "Tokyo" -> 2
OneHotEncoder (Estimator)
Converts a numeric index column to a sparse binary vector. Typically used after StringIndexer for categorical features in linear models.
from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder(inputCol="cityIndex", outputCol="cityVec")
model = encoder.fit(df_indexed)
df_encoded = model.transform(df_indexed)
# cityIndex 0 -> [1, 0, 0], cityIndex 1 -> [0, 1, 0]
Bucketizer (Transformer)
Converts continuous values into discrete bins based on specified boundaries.
from pyspark.ml.feature import Bucketizer
bucketizer = Bucketizer(
splits=[0, 18, 35, 55, 100], # Age bins: [0-18), [18-35), [35-55), [55-100)
inputCol="age",
outputCol="ageBucket"
)
df_bucketed = bucketizer.transform(df)
StandardScaler (Estimator)
Standardizes features by subtracting mean and dividing by standard deviation (z-score normalization).
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(
inputCol="features",
outputCol="scaledFeatures",
withStd=True,
withMean=True
)
scaler_model = scaler.fit(df) # Learns mean and std
df_scaled = scaler_model.transform(df)
Imputer (Estimator)
Replaces missing (NaN) values with the mean or median of the column.
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=["age", "salary"],
outputCols=["age_imputed", "salary_imputed"],
strategy="median" # or "mean", "mode"
)
imputer_model = imputer.fit(df)
df_imputed = imputer_model.transform(df)
Transformer vs. Estimator Cheat Sheet
Transformers (no fit needed)
- VectorAssembler
- Bucketizer
- SQLTransformer
- Tokenizer
- Binarizer
Estimators (fit required)
- StringIndexer
- OneHotEncoder
- StandardScaler
- MinMaxScaler
- Imputer
Practice Questions
Question 1
A) VectorAssembler then StandardScaler
B) StringIndexer then OneHotEncoder
C) Bucketizer then VectorAssembler
D) Imputer then StringIndexer
Answer: B — First, StringIndexer converts strings to numeric indices ("red"->0, "blue"->1, "green"->2). Then, OneHotEncoder converts the index to a sparse binary vector. This two-step process is the standard pattern for encoding categorical features in Spark ML.
Question 2
A) 4
B) 3
C) 5
D) 2
Answer: B — With dropLast=True (the default), OneHotEncoder produces n-1 dimensions for n categories. For 4 categories, the output vector has 3 dimensions. The last category is represented by all zeros, avoiding multicollinearity in linear models.
Question 3
A) StringIndexer
B) OneHotEncoder
C) VectorAssembler
D) StandardScaler
Answer: C — VectorAssembler combines multiple columns (numeric, boolean, or vector) into a single vector column. Spark ML algorithms require a single features column, so VectorAssembler is used in nearly every ML pipeline.
Question 4
A) VectorAssembler
B) Bucketizer
C) Imputer with strategy="median"
D) StringIndexer
Answer: C — Imputer replaces NaN values with the computed statistic (mean, median, or mode). Setting strategy="median" fills missing values with the column's median. VectorAssembler and Bucketizer do not handle missing values. StringIndexer is for categorical data.
Question 5
A) Estimator — it learns bucket boundaries from data
B) Transformer — it applies user-specified boundaries directly without learning
C) Evaluator — it measures binning quality
D) Pipeline — it chains multiple stages
Answer: B — Bucketizer is a Transformer because the bucket boundaries (splits) are specified by the user, not learned from data. It directly transforms continuous values into discrete bins using the provided boundaries. No fit step is needed.