Intermediate

Feature Engineering

Master Databricks Feature Store, Spark ML pipelines, and data preparation techniques — covering approximately 20% of the Databricks ML Professional exam.

Databricks Feature Store

The Databricks Feature Store is a centralized repository for storing, discovering, and sharing features across ML projects. It is a critical exam topic because it solves the feature consistency problem — ensuring training and serving use the same feature computation logic.

Key Concepts

  • Feature Table — A Delta table managed by the Feature Store with a primary key and optional timestamp key for point-in-time lookups
  • Feature Lookup — Joining feature tables with training data using FeatureLookup objects at training time
  • Online Store — A low-latency store (e.g., Amazon DynamoDB, Azure Cosmos DB) synced from the offline Feature Store for real-time serving
  • Feature Freshness — How recently the features were computed; critical for time-sensitive models

Creating a Feature Table

The exam tests your knowledge of the Feature Store API. Here is the standard pattern:

from databricks.feature_store import FeatureStoreClient

fs = FeatureStoreClient()

# Create a feature table from a Spark DataFrame
fs.create_table(
    name="ml_features.customer_features",
    primary_keys=["customer_id"],
    timestamp_keys=["event_date"],  # for point-in-time lookups
    df=customer_features_df,
    description="Customer behavioral features for churn prediction"
)
💡
Exam tip: Know the difference between create_table() (creates new) and write_table() (updates existing). The mode parameter in write_table() can be "merge" (upsert) or "overwrite". The exam frequently tests this distinction.

Training with Feature Lookups

Instead of manually joining features, the Feature Store handles lookups automatically:

from databricks.feature_store import FeatureLookup

feature_lookups = [
    FeatureLookup(
        table_name="ml_features.customer_features",
        feature_names=["total_purchases", "avg_session_duration"],
        lookup_key="customer_id"
    ),
    FeatureLookup(
        table_name="ml_features.product_features",
        feature_names=["category_popularity", "price_percentile"],
        lookup_key="product_id"
    )
]

training_set = fs.create_training_set(
    df=labels_df,
    feature_lookups=feature_lookups,
    label="churned"
)

training_df = training_set.load_df()

Publishing to Online Store

For real-time serving, features must be published to an online store:

# Publish feature table to online store
fs.publish_table(
    name="ml_features.customer_features",
    online_store=online_store_spec,
    filter_condition="event_date >= '2026-01-01'"
)
Common exam trap: The online store is for serving (low-latency lookups), not training. Training always uses the offline Feature Store (Delta tables). If an exam question asks about training data, the answer involves create_training_set(), NOT the online store.

Spark ML Pipelines

Spark ML pipelines provide a standardized way to chain feature transformations and model training into a single, reproducible workflow.

Pipeline Components

  • Transformer — Takes a DataFrame and produces a new DataFrame (e.g., StringIndexer, VectorAssembler, StandardScaler)
  • Estimator — An algorithm that can be fit() on a DataFrame to produce a Transformer (e.g., StandardScaler before fitting)
  • Pipeline — A sequence of Transformers and Estimators chained together
  • PipelineModel — A fitted Pipeline that can transform() new data
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler

# Define pipeline stages
indexer = StringIndexer(inputCol="category", outputCol="category_idx")
assembler = VectorAssembler(
    inputCols=["category_idx", "amount", "frequency"],
    outputCol="features_raw"
)
scaler = StandardScaler(inputCol="features_raw", outputCol="features")

pipeline = Pipeline(stages=[indexer, assembler, scaler])

# Fit the pipeline
pipeline_model = pipeline.fit(train_df)

# Transform new data
transformed_df = pipeline_model.transform(test_df)
💡
Exam tip: A Pipeline is an Estimator. When you call pipeline.fit(df), it returns a PipelineModel, which is a Transformer. This distinction appears frequently in exam questions about the ML pipeline API.

Data Preparation Techniques

Handling Missing Values

# Imputation with Spark
from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols=["age", "income"],
    outputCols=["age_imputed", "income_imputed"],
    strategy="median"  # "mean", "median", or "mode"
)

Encoding Categorical Variables

  • StringIndexer — Maps string values to numeric indices (frequency-based ordering)
  • OneHotEncoder — Converts indexed values to binary vectors (use for low-cardinality features)
  • Feature Hashing — Maps high-cardinality features to fixed-size vectors using FeatureHasher

Feature Scaling

  • StandardScaler — Standardizes features to zero mean and unit variance (use with linear models, neural nets)
  • MinMaxScaler — Scales features to [0, 1] range (use when you need bounded values)
  • MaxAbsScaler — Scales by dividing by max absolute value (preserves sparsity)

Point-in-Time Lookups

For time-series and event-driven features, point-in-time correctness prevents data leakage:

# Feature table with timestamp key enables point-in-time lookups
fs.create_table(
    name="ml_features.user_daily_stats",
    primary_keys=["user_id"],
    timestamp_keys=["date"],  # ensures no future data leaks into training
    df=daily_stats_df
)

# At training time, the Feature Store automatically performs
# as-of joins using the timestamp key
Data leakage alert: The exam frequently tests whether you understand point-in-time correctness. If a feature table has a timestamp_keys parameter, the Feature Store will only join features that were available at or before the event time. Without this, you risk using future data during training.

Practice Questions


Question 1 — Feature Store

Q1
A data science team has created a feature table with primary_keys=["user_id"] and timestamp_keys=["event_date"]. They need to update the table with new daily features. Which approach is correct?

A) fs.create_table(name="...", df=new_features_df)
B) fs.write_table(name="...", df=new_features_df, mode="merge")
C) fs.write_table(name="...", df=new_features_df, mode="overwrite")
D) fs.update_table(name="...", df=new_features_df)

Answer: Bwrite_table() with mode="merge" performs an upsert based on the primary key, adding new rows and updating existing ones. Option A would fail because the table already exists. Option C would delete all existing data. Option D does not exist in the API.

Question 2 — Feature Lookup

Q2
When using fs.create_training_set() with feature lookups, what happens if a row in the labels DataFrame has no matching entry in the feature table?

A) The row is dropped from the training set
B) The feature columns are filled with null values
C) An error is raised
D) The row is duplicated with default values

Answer: B — The Feature Store performs a left join from the labels DataFrame to the feature tables. If no match is found, the feature columns are populated with null values. It is the data scientist's responsibility to handle these nulls (e.g., with imputation or filtering).

Question 3 — Spark ML Pipeline

Q3
A Pipeline consists of [StringIndexer, VectorAssembler, LogisticRegression]. What type is returned when you call pipeline.fit(train_df)?

A) Pipeline
B) PipelineModel
C) LogisticRegressionModel
D) Transformer

Answer: B — Calling fit() on a Pipeline (which is an Estimator) returns a PipelineModel (which is a Transformer). The PipelineModel contains the fitted versions of each stage. While PipelineModel is technically a Transformer, the most specific correct answer is PipelineModel.

Question 4 — Point-in-Time

Q4
A feature table stores daily aggregated user metrics with timestamp_keys=["calc_date"]. Training labels have an event_date column. When creating a training set, the Feature Store joins features where:

A) calc_date = event_date
B) calc_date <= event_date (most recent before or on event date)
C) calc_date >= event_date (first available after event date)
D) calc_date is ignored and the latest features are always used

Answer: B — Point-in-time lookups use an as-of join where the Feature Store selects the most recent feature row with calc_date <= event_date. This prevents data leakage by ensuring only features available at or before the event time are used for training.

Question 5 — Data Preparation

Q5
A team needs to encode a categorical column with 50,000 unique values for use in a Spark ML pipeline. Which approach is most appropriate?

A) StringIndexer followed by OneHotEncoder
B) FeatureHasher with a fixed number of output features
C) Convert to pandas and use scikit-learn LabelEncoder
D) Store each category as a separate binary column manually

Answer: B — With 50,000 unique values, OneHotEncoder would create an extremely high-dimensional sparse vector (option A). FeatureHasher maps high-cardinality features to a fixed-size vector using hashing, which is memory-efficient and scalable. Option C breaks Spark parallelism, and option D is impractical at this cardinality.