Intermediate

Feature Engineering in Snowflake

Learn how to build robust feature pipelines using SQL transforms, Snowpark DataFrames, and the Snowflake Feature Store — all within Snowflake's compute engine for exam readiness.

SQL-Based Feature Transforms

Snowflake's SQL engine is powerful for feature engineering. Many common ML feature transformations can be expressed as SQL, which is often the simplest and most performant approach.

Numerical Feature Transforms

-- Log transform to handle skewed distributions
SELECT
    CUSTOMER_ID,
    LN(TOTAL_SPEND + 1) AS LOG_SPEND,
    SQRT(VISIT_COUNT) AS SQRT_VISITS,
    (TOTAL_SPEND - AVG(TOTAL_SPEND) OVER()) /
        STDDEV(TOTAL_SPEND) OVER() AS ZSCORE_SPEND
FROM CUSTOMER_DATA;

-- Binning continuous values
SELECT
    CUSTOMER_ID,
    CASE
        WHEN AGE < 25 THEN 'young'
        WHEN AGE BETWEEN 25 AND 45 THEN 'middle'
        ELSE 'senior'
    END AS AGE_GROUP,
    NTILE(10) OVER (ORDER BY TOTAL_SPEND) AS SPEND_DECILE
FROM CUSTOMER_DATA;

Window Functions for Temporal Features

Window functions are essential for creating time-based ML features without self-joins.

-- Rolling aggregations for time-series features
SELECT
    CUSTOMER_ID,
    ORDER_DATE,
    ORDER_AMOUNT,
    AVG(ORDER_AMOUNT) OVER (
        PARTITION BY CUSTOMER_ID
        ORDER BY ORDER_DATE
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS AVG_ORDER_7D,
    COUNT(*) OVER (
        PARTITION BY CUSTOMER_ID
        ORDER BY ORDER_DATE
        ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
    ) AS ORDER_COUNT_30D,
    LAG(ORDER_DATE) OVER (
        PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE
    ) AS PREV_ORDER_DATE,
    DATEDIFF('day',
        LAG(ORDER_DATE) OVER (PARTITION BY CUSTOMER_ID ORDER BY ORDER_DATE),
        ORDER_DATE
    ) AS DAYS_SINCE_LAST_ORDER
FROM ORDERS;

💡

Exam tip: Window functions with ROWS BETWEEN and RANGE BETWEEN are heavily tested. Know the difference: ROWS counts physical rows, RANGE considers logical value ranges. For time-series ML features, ROWS is usually preferred for predictable window sizes.

Categorical Feature Encoding

-- One-hot encoding with PIVOT
SELECT * FROM (
    SELECT CUSTOMER_ID, CATEGORY, 1 AS FLAG
    FROM PURCHASE_HISTORY
)
PIVOT (MAX(FLAG) FOR CATEGORY IN ('electronics', 'clothing', 'food'))
AS p;

-- Label encoding with DENSE_RANK
SELECT
    CUSTOMER_ID,
    CITY,
    DENSE_RANK() OVER (ORDER BY CITY) - 1 AS CITY_ENCODED
FROM CUSTOMERS;

-- Target encoding (mean encoding)
SELECT
    c.CUSTOMER_ID,
    c.CITY,
    city_stats.AVG_TARGET AS CITY_TARGET_ENCODED
FROM CUSTOMERS c
JOIN (
    SELECT CITY, AVG(TARGET) AS AVG_TARGET
    FROM CUSTOMERS
    GROUP BY CITY
) city_stats ON c.CITY = city_stats.CITY;

Snowpark Feature Pipelines

For more complex transformations, Snowpark provides a programmatic Python API that builds on top of Snowflake SQL.

from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, avg, count, lag, datediff, lit
from snowflake.snowpark.window import Window

# Define window specs
customer_window = Window.partition_by("CUSTOMER_ID") \
    .order_by("ORDER_DATE") \
    .rows_between(-6, Window.CURRENT_ROW)

# Build feature pipeline
df_features = session.table("ORDERS") \
    .with_column("AVG_ORDER_7D",
        avg("ORDER_AMOUNT").over(customer_window)) \
    .with_column("DAYS_SINCE_LAST",
        datediff("day",
            lag("ORDER_DATE").over(
                Window.partition_by("CUSTOMER_ID").order_by("ORDER_DATE")),
            col("ORDER_DATE"))) \
    .with_column("LOG_AMOUNT",
        snowflake.snowpark.functions.log(lit(10), col("ORDER_AMOUNT") + lit(1)))

# Write features to table
df_features.write.mode("overwrite").save_as_table("CUSTOMER_FEATURES")

Snowflake Feature Store

The Snowflake Feature Store provides a centralized repository for managing, sharing, and serving ML features. It ensures consistency between training and inference.

📚

Key Feature Store concepts:

Feature Entity: The primary key that features are associated with (e.g., customer_id, product_id)
Feature View: A logical grouping of features computed from source data, with a defined schema and refresh schedule
Point-in-time correctness: The Feature Store handles time-travel to prevent data leakage by serving features as of a specific timestamp
Feature pipelines: Automated workflows that compute and materialize features on a schedule

from snowflake.ml.feature_store import FeatureStore, FeatureView, Entity

# Initialize Feature Store
fs = FeatureStore(session=session, database="ML_DB", name="ML_FEATURES", default_warehouse="ML_WH")

# Define an entity
customer_entity = Entity(name="CUSTOMER", join_keys=["CUSTOMER_ID"])
fs.register_entity(customer_entity)

# Create a Feature View from a DataFrame or SQL query
customer_fv = FeatureView(
    name="CUSTOMER_SPEND_FEATURES",
    entities=[customer_entity],
    feature_df=df_features,  # Snowpark DataFrame with feature logic
    refresh_freq="1 day",     # Refresh daily
    description="Customer spending behavior features"
)

# Register and materialize
fs.register_feature_view(
    feature_view=customer_fv,
    version="v1"
)

# Retrieve features for training (point-in-time join)
training_data = fs.retrieve_feature_values(
    spine_df=df_labels,  # DataFrame with entity keys and timestamps
    features=[customer_fv],
    spine_timestamp_col="EVENT_TIMESTAMP"
)

⚠

Exam focus: Point-in-time correctness is a critical concept. Without it, features computed from future data can leak into training, producing models that perform well in backtesting but fail in production. The Feature Store's spine_timestamp_col parameter prevents this leakage automatically.

Handling Missing Data

The exam tests your knowledge of missing data strategies within Snowflake.

SQL COALESCE / IFNULL: Replace NULLs with default values — COALESCE(column, 0)
Conditional imputation: Use CASE statements to impute based on group statistics
Window-based imputation: Fill with rolling averages using window functions
Snowpark ML Imputer: The snowflake.ml.modeling.impute.SimpleImputer class provides mean, median, and most-frequent imputation

from snowflake.ml.modeling.impute import SimpleImputer

imputer = SimpleImputer(
    input_cols=["INCOME", "AGE"],
    output_cols=["INCOME_IMPUTED", "AGE_IMPUTED"],
    strategy="median"
)
df_imputed = imputer.fit(df_train).transform(df_train)

Feature Scaling

Snowpark ML provides preprocessing transformers that mirror scikit-learn's API.

from snowflake.ml.modeling.preprocessing import (
    StandardScaler, MinMaxScaler, OneHotEncoder, OrdinalEncoder
)

# Standard scaling (z-score normalization)
scaler = StandardScaler(
    input_cols=["TOTAL_SPEND", "VISIT_COUNT"],
    output_cols=["SPEND_SCALED", "VISITS_SCALED"]
)
df_scaled = scaler.fit(df_train).transform(df_train)

# One-hot encoding
encoder = OneHotEncoder(
    input_cols=["CATEGORY"],
    output_cols=["CATEGORY_ENCODED"]
)
df_encoded = encoder.fit(df_train).transform(df_train)

Practice Questions

Question 1

A data scientist is building time-series features for a churn prediction model. They need to calculate the average order amount over the last 30 days for each customer. Which approach is most efficient in Snowflake?

A) Extract data to pandas and use rolling window calculations
B) Use a SQL window function with ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
C) Create a stored procedure that loops through each customer
D) Use a self-join with a date range filter

Answer: B — SQL window functions are the most efficient approach for rolling aggregations in Snowflake. They execute natively on the warehouse engine with optimized parallel processing. Extracting to pandas (A) moves data out of Snowflake. Looping in a stored procedure (C) is procedural and slow. Self-joins (D) work but are less efficient and harder to maintain than window functions.

Question 2

What is the primary purpose of the Snowflake Feature Store's point-in-time join capability?

A) To improve query performance by caching features
B) To prevent data leakage by serving features as of a specific timestamp
C) To automatically select the best features for a model
D) To compress feature data for faster storage

Answer: B — Point-in-time correctness ensures that when you retrieve features for training, you only get feature values that were available at the time of each training example. Without this, future data can leak into training features, producing overly optimistic model performance that does not hold in production.

Question 3

A team needs to apply StandardScaler normalization to numeric features before training a model in Snowflake. They want the scaling to be consistent between training and inference. Which approach is recommended?

A) Write a custom SQL query with hardcoded mean and standard deviation values
B) Use the snowflake.ml.modeling.preprocessing.StandardScaler, fit on training data, and reuse the fitted scaler for inference
C) Normalize each batch independently during inference
D) Skip normalization since Snowflake handles it automatically

Answer: B — The Snowpark ML StandardScaler fits on training data (learning mean and std) and can be saved and reused for inference, ensuring consistent scaling. Hardcoded values (A) are brittle. Independent normalization per batch (C) produces inconsistent features. Snowflake does not automatically normalize (D).

Question 4

Which Snowflake Feature Store concept represents a logical grouping of features computed from source data with a defined schema and refresh schedule?

A) Feature Entity
B) Feature View
C) Feature Pipeline
D) Feature Table

Answer: B — A Feature View is the core abstraction in the Snowflake Feature Store that defines a logical grouping of features, their computation logic, schema, and refresh schedule. An Entity defines the primary key (e.g., customer_id). Feature Pipeline is the process of computing features. Feature Table is not a specific Feature Store concept.

← Previous Snowpark for ML Next → Model Training