Feature Engineering
Master Databricks Feature Store, Spark ML pipelines, and data preparation techniques — covering approximately 20% of the Databricks ML Professional exam.
Databricks Feature Store
The Databricks Feature Store is a centralized repository for storing, discovering, and sharing features across ML projects. It is a critical exam topic because it solves the feature consistency problem — ensuring training and serving use the same feature computation logic.
Key Concepts
- Feature Table — A Delta table managed by the Feature Store with a primary key and optional timestamp key for point-in-time lookups
- Feature Lookup — Joining feature tables with training data using
FeatureLookupobjects at training time - Online Store — A low-latency store (e.g., Amazon DynamoDB, Azure Cosmos DB) synced from the offline Feature Store for real-time serving
- Feature Freshness — How recently the features were computed; critical for time-sensitive models
Creating a Feature Table
The exam tests your knowledge of the Feature Store API. Here is the standard pattern:
from databricks.feature_store import FeatureStoreClient
fs = FeatureStoreClient()
# Create a feature table from a Spark DataFrame
fs.create_table(
name="ml_features.customer_features",
primary_keys=["customer_id"],
timestamp_keys=["event_date"], # for point-in-time lookups
df=customer_features_df,
description="Customer behavioral features for churn prediction"
)
create_table() (creates new) and write_table() (updates existing). The mode parameter in write_table() can be "merge" (upsert) or "overwrite". The exam frequently tests this distinction.Training with Feature Lookups
Instead of manually joining features, the Feature Store handles lookups automatically:
from databricks.feature_store import FeatureLookup
feature_lookups = [
FeatureLookup(
table_name="ml_features.customer_features",
feature_names=["total_purchases", "avg_session_duration"],
lookup_key="customer_id"
),
FeatureLookup(
table_name="ml_features.product_features",
feature_names=["category_popularity", "price_percentile"],
lookup_key="product_id"
)
]
training_set = fs.create_training_set(
df=labels_df,
feature_lookups=feature_lookups,
label="churned"
)
training_df = training_set.load_df()
Publishing to Online Store
For real-time serving, features must be published to an online store:
# Publish feature table to online store
fs.publish_table(
name="ml_features.customer_features",
online_store=online_store_spec,
filter_condition="event_date >= '2026-01-01'"
)
create_training_set(), NOT the online store.Spark ML Pipelines
Spark ML pipelines provide a standardized way to chain feature transformations and model training into a single, reproducible workflow.
Pipeline Components
- Transformer — Takes a DataFrame and produces a new DataFrame (e.g.,
StringIndexer,VectorAssembler,StandardScaler) - Estimator — An algorithm that can be
fit()on a DataFrame to produce a Transformer (e.g.,StandardScalerbefore fitting) - Pipeline — A sequence of Transformers and Estimators chained together
- PipelineModel — A fitted Pipeline that can
transform()new data
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
# Define pipeline stages
indexer = StringIndexer(inputCol="category", outputCol="category_idx")
assembler = VectorAssembler(
inputCols=["category_idx", "amount", "frequency"],
outputCol="features_raw"
)
scaler = StandardScaler(inputCol="features_raw", outputCol="features")
pipeline = Pipeline(stages=[indexer, assembler, scaler])
# Fit the pipeline
pipeline_model = pipeline.fit(train_df)
# Transform new data
transformed_df = pipeline_model.transform(test_df)
Pipeline is an Estimator. When you call pipeline.fit(df), it returns a PipelineModel, which is a Transformer. This distinction appears frequently in exam questions about the ML pipeline API.Data Preparation Techniques
Handling Missing Values
# Imputation with Spark
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=["age", "income"],
outputCols=["age_imputed", "income_imputed"],
strategy="median" # "mean", "median", or "mode"
)
Encoding Categorical Variables
- StringIndexer — Maps string values to numeric indices (frequency-based ordering)
- OneHotEncoder — Converts indexed values to binary vectors (use for low-cardinality features)
- Feature Hashing — Maps high-cardinality features to fixed-size vectors using
FeatureHasher
Feature Scaling
- StandardScaler — Standardizes features to zero mean and unit variance (use with linear models, neural nets)
- MinMaxScaler — Scales features to [0, 1] range (use when you need bounded values)
- MaxAbsScaler — Scales by dividing by max absolute value (preserves sparsity)
Point-in-Time Lookups
For time-series and event-driven features, point-in-time correctness prevents data leakage:
# Feature table with timestamp key enables point-in-time lookups
fs.create_table(
name="ml_features.user_daily_stats",
primary_keys=["user_id"],
timestamp_keys=["date"], # ensures no future data leaks into training
df=daily_stats_df
)
# At training time, the Feature Store automatically performs
# as-of joins using the timestamp key
timestamp_keys parameter, the Feature Store will only join features that were available at or before the event time. Without this, you risk using future data during training.Practice Questions
Question 1 — Feature Store
primary_keys=["user_id"] and timestamp_keys=["event_date"]. They need to update the table with new daily features. Which approach is correct?A)
fs.create_table(name="...", df=new_features_df)B)
fs.write_table(name="...", df=new_features_df, mode="merge")C)
fs.write_table(name="...", df=new_features_df, mode="overwrite")D)
fs.update_table(name="...", df=new_features_df)Answer: B —
write_table() with mode="merge" performs an upsert based on the primary key, adding new rows and updating existing ones. Option A would fail because the table already exists. Option C would delete all existing data. Option D does not exist in the API.
Question 2 — Feature Lookup
fs.create_training_set() with feature lookups, what happens if a row in the labels DataFrame has no matching entry in the feature table?A) The row is dropped from the training set
B) The feature columns are filled with null values
C) An error is raised
D) The row is duplicated with default values
Answer: B — The Feature Store performs a left join from the labels DataFrame to the feature tables. If no match is found, the feature columns are populated with null values. It is the data scientist's responsibility to handle these nulls (e.g., with imputation or filtering).
Question 3 — Spark ML Pipeline
pipeline.fit(train_df)?A) Pipeline
B) PipelineModel
C) LogisticRegressionModel
D) Transformer
Answer: B — Calling
fit() on a Pipeline (which is an Estimator) returns a PipelineModel (which is a Transformer). The PipelineModel contains the fitted versions of each stage. While PipelineModel is technically a Transformer, the most specific correct answer is PipelineModel.
Question 4 — Point-in-Time
timestamp_keys=["calc_date"]. Training labels have an event_date column. When creating a training set, the Feature Store joins features where:A)
calc_date = event_dateB)
calc_date <= event_date (most recent before or on event date)C)
calc_date >= event_date (first available after event date)D)
calc_date is ignored and the latest features are always usedAnswer: B — Point-in-time lookups use an as-of join where the Feature Store selects the most recent feature row with
calc_date <= event_date. This prevents data leakage by ensuring only features available at or before the event time are used for training.
Question 5 — Data Preparation
A) StringIndexer followed by OneHotEncoder
B) FeatureHasher with a fixed number of output features
C) Convert to pandas and use scikit-learn LabelEncoder
D) Store each category as a separate binary column manually
Answer: B — With 50,000 unique values, OneHotEncoder would create an extremely high-dimensional sparse vector (option A). FeatureHasher maps high-cardinality features to a fixed-size vector using hashing, which is memory-efficient and scalable. Option C breaks Spark parallelism, and option D is impractical at this cardinality.
Lilly Tech Systems