Feature Engineering in Snowflake
Learn how to build robust feature pipelines using SQL transforms, Snowpark DataFrames, and the Snowflake Feature Store — all within Snowflake's compute engine for exam readiness.
SQL-Based Feature Transforms
Snowflake's SQL engine is powerful for feature engineering. Many common ML feature transformations can be expressed as SQL, which is often the simplest and most performant approach.
Numerical Feature Transforms
-- Log transform to handle skewed distributions
SELECT
CUSTOMER_ID,
LN(TOTAL_SPEND + 1) AS LOG_SPEND,
SQRT(VISIT_COUNT) AS SQRT_VISITS,
(TOTAL_SPEND - AVG(TOTAL_SPEND) OVER()) /
STDDEV(TOTAL_SPEND) OVER() AS ZSCORE_SPEND
FROM CUSTOMER_DATA;
-- Binning continuous values
SELECT
CUSTOMER_ID,
CASE
WHEN AGE < 25 THEN 'young'
WHEN AGE BETWEEN 25 AND 45 THEN 'middle'
ELSE 'senior'
END AS AGE_GROUP,
NTILE(10) OVER (ORDER BY TOTAL_SPEND) AS SPEND_DECILE
FROM CUSTOMER_DATA;
Window Functions for Temporal Features
Window functions are essential for creating time-based ML features without self-joins.
-- Rolling aggregations for time-series features
SELECT
CUSTOMER_ID,
ORDER_DATE,
ORDER_AMOUNT,
AVG(ORDER_AMOUNT) OVER (
PARTITION BY CUSTOMER_ID
ORDER BY ORDER_DATE
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS AVG_ORDER_7D,
COUNT(*) OVER (
PARTITION BY CUSTOMER_ID
ORDER BY ORDER_DATE
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
) AS ORDER_COUNT_30D,
LAG(ORDER_DATE) OVER (
PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE
) AS PREV_ORDER_DATE,
DATEDIFF('day',
LAG(ORDER_DATE) OVER (PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE),
ORDER_DATE
) AS DAYS_SINCE_LAST_ORDER
FROM ORDERS;
ROWS BETWEEN and RANGE BETWEEN are heavily tested. Know the difference: ROWS counts physical rows, RANGE considers logical value ranges. For time-series ML features, ROWS is usually preferred for predictable window sizes.Categorical Feature Encoding
-- One-hot encoding with PIVOT
SELECT * FROM (
SELECT CUSTOMER_ID, CATEGORY, 1 AS FLAG
FROM PURCHASE_HISTORY
)
PIVOT (MAX(FLAG) FOR CATEGORY IN ('electronics', 'clothing', 'food'))
AS p;
-- Label encoding with DENSE_RANK
SELECT
CUSTOMER_ID,
CITY,
DENSE_RANK() OVER (ORDER BY CITY) - 1 AS CITY_ENCODED
FROM CUSTOMERS;
-- Target encoding (mean encoding)
SELECT
c.CUSTOMER_ID,
c.CITY,
city_stats.AVG_TARGET AS CITY_TARGET_ENCODED
FROM CUSTOMERS c
JOIN (
SELECT CITY, AVG(TARGET) AS AVG_TARGET
FROM CUSTOMERS
GROUP BY CITY
) city_stats ON c.CITY = city_stats.CITY;
Snowpark Feature Pipelines
For more complex transformations, Snowpark provides a programmatic Python API that builds on top of Snowflake SQL.
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, avg, count, lag, datediff, lit
from snowflake.snowpark.window import Window
# Define window specs
customer_window = Window.partition_by("CUSTOMER_ID") \
.order_by("ORDER_DATE") \
.rows_between(-6, Window.CURRENT_ROW)
# Build feature pipeline
df_features = session.table("ORDERS") \
.with_column("AVG_ORDER_7D",
avg("ORDER_AMOUNT").over(customer_window)) \
.with_column("DAYS_SINCE_LAST",
datediff("day",
lag("ORDER_DATE").over(
Window.partition_by("CUSTOMER_ID").order_by("ORDER_DATE")),
col("ORDER_DATE"))) \
.with_column("LOG_AMOUNT",
snowflake.snowpark.functions.log(lit(10), col("ORDER_AMOUNT") + lit(1)))
# Write features to table
df_features.write.mode("overwrite").save_as_table("CUSTOMER_FEATURES")
Snowflake Feature Store
The Snowflake Feature Store provides a centralized repository for managing, sharing, and serving ML features. It ensures consistency between training and inference.
- Feature Entity: The primary key that features are associated with (e.g., customer_id, product_id)
- Feature View: A logical grouping of features computed from source data, with a defined schema and refresh schedule
- Point-in-time correctness: The Feature Store handles time-travel to prevent data leakage by serving features as of a specific timestamp
- Feature pipelines: Automated workflows that compute and materialize features on a schedule
from snowflake.ml.feature_store import FeatureStore, FeatureView, Entity
# Initialize Feature Store
fs = FeatureStore(session=session, database="ML_DB", name="ML_FEATURES", default_warehouse="ML_WH")
# Define an entity
customer_entity = Entity(name="CUSTOMER", join_keys=["CUSTOMER_ID"])
fs.register_entity(customer_entity)
# Create a Feature View from a DataFrame or SQL query
customer_fv = FeatureView(
name="CUSTOMER_SPEND_FEATURES",
entities=[customer_entity],
feature_df=df_features, # Snowpark DataFrame with feature logic
refresh_freq="1 day", # Refresh daily
description="Customer spending behavior features"
)
# Register and materialize
fs.register_feature_view(
feature_view=customer_fv,
version="v1"
)
# Retrieve features for training (point-in-time join)
training_data = fs.retrieve_feature_values(
spine_df=df_labels, # DataFrame with entity keys and timestamps
features=[customer_fv],
spine_timestamp_col="EVENT_TIMESTAMP"
)
spine_timestamp_col parameter prevents this leakage automatically.Handling Missing Data
The exam tests your knowledge of missing data strategies within Snowflake.
- SQL COALESCE / IFNULL: Replace NULLs with default values —
COALESCE(column, 0) - Conditional imputation: Use CASE statements to impute based on group statistics
- Window-based imputation: Fill with rolling averages using window functions
- Snowpark ML Imputer: The
snowflake.ml.modeling.impute.SimpleImputerclass provides mean, median, and most-frequent imputation
from snowflake.ml.modeling.impute import SimpleImputer
imputer = SimpleImputer(
input_cols=["INCOME", "AGE"],
output_cols=["INCOME_IMPUTED", "AGE_IMPUTED"],
strategy="median"
)
df_imputed = imputer.fit(df_train).transform(df_train)
Feature Scaling
Snowpark ML provides preprocessing transformers that mirror scikit-learn's API.
from snowflake.ml.modeling.preprocessing import (
StandardScaler, MinMaxScaler, OneHotEncoder, OrdinalEncoder
)
# Standard scaling (z-score normalization)
scaler = StandardScaler(
input_cols=["TOTAL_SPEND", "VISIT_COUNT"],
output_cols=["SPEND_SCALED", "VISITS_SCALED"]
)
df_scaled = scaler.fit(df_train).transform(df_train)
# One-hot encoding
encoder = OneHotEncoder(
input_cols=["CATEGORY"],
output_cols=["CATEGORY_ENCODED"]
)
df_encoded = encoder.fit(df_train).transform(df_train)
Practice Questions
Question 1
A) Extract data to pandas and use rolling window calculations
B) Use a SQL window function with ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
C) Create a stored procedure that loops through each customer
D) Use a self-join with a date range filter
Answer: B — SQL window functions are the most efficient approach for rolling aggregations in Snowflake. They execute natively on the warehouse engine with optimized parallel processing. Extracting to pandas (A) moves data out of Snowflake. Looping in a stored procedure (C) is procedural and slow. Self-joins (D) work but are less efficient and harder to maintain than window functions.
Question 2
A) To improve query performance by caching features
B) To prevent data leakage by serving features as of a specific timestamp
C) To automatically select the best features for a model
D) To compress feature data for faster storage
Answer: B — Point-in-time correctness ensures that when you retrieve features for training, you only get feature values that were available at the time of each training example. Without this, future data can leak into training features, producing overly optimistic model performance that does not hold in production.
Question 3
A) Write a custom SQL query with hardcoded mean and standard deviation values
B) Use the snowflake.ml.modeling.preprocessing.StandardScaler, fit on training data, and reuse the fitted scaler for inference
C) Normalize each batch independently during inference
D) Skip normalization since Snowflake handles it automatically
Answer: B — The Snowpark ML StandardScaler fits on training data (learning mean and std) and can be saved and reused for inference, ensuring consistent scaling. Hardcoded values (A) are brittle. Independent normalization per batch (C) produces inconsistent features. Snowflake does not automatically normalize (D).
Question 4
A) Feature Entity
B) Feature View
C) Feature Pipeline
D) Feature Table
Answer: B — A Feature View is the core abstraction in the Snowflake Feature Store that defines a logical grouping of features, their computation logic, schema, and refresh schedule. An Entity defines the primary key (e.g., customer_id). Feature Pipeline is the process of computing features. Feature Table is not a specific Feature Store concept.
Lilly Tech Systems