Snowpark for ML
Master the Snowpark Python API for machine learning workloads — DataFrame operations, user-defined functions, vectorized UDFs, and stored procedures that execute directly inside Snowflake's compute engine.
What Is Snowpark?
Snowpark is Snowflake's developer framework that lets you write data processing and ML code in Python (also Java and Scala) that runs directly on Snowflake's compute infrastructure. Instead of extracting data to an external environment, Snowpark pushes your code to the data — eliminating data movement and leveraging Snowflake's elastic compute.
Snowpark Session
Every Snowpark interaction begins with a Session object. The session establishes the connection to Snowflake and serves as the entry point for all DataFrame operations.
from snowflake.snowpark import Session
connection_params = {
"account": "myaccount",
"user": "myuser",
"password": "mypassword",
"warehouse": "ML_WH",
"database": "ML_DB",
"schema": "FEATURES"
}
session = Session.builder.configs(connection_params).create()
Key session concepts for the exam:
- Lazy evaluation: DataFrame operations build a query plan but do not execute until an action (collect, show, count, write) is called
- Server-side execution: Transformations run on Snowflake's compute, not your local machine
- Warehouse selection: The warehouse determines the compute resources available for your ML workload
- Session scope: Each session has a default database, schema, and warehouse that can be changed at any time
Snowpark DataFrames
Snowpark DataFrames look similar to pandas but operate very differently. They represent a lazy query plan executed on Snowflake's engine.
# Read from a table
df = session.table("CUSTOMER_FEATURES")
# Select and filter
df_filtered = df.select("CUSTOMER_ID", "TOTAL_SPEND", "VISIT_COUNT") \
.filter(df["TOTAL_SPEND"] > 100)
# Aggregations
df_agg = df.group_by("SEGMENT").agg(
avg("TOTAL_SPEND").alias("AVG_SPEND"),
count("CUSTOMER_ID").alias("CUSTOMER_COUNT")
)
# Join DataFrames
df_joined = df_features.join(df_labels, on="CUSTOMER_ID", how="inner")
.to_pandas() pulls all data to the client — this defeats the purpose of server-side processing and can cause out-of-memory errors on large datasets. The exam tests whether you know to keep operations in Snowpark DataFrames as long as possible.DataFrame vs. Pandas Comparison
Snowpark DataFrame
- Lazy evaluation (builds query plan)
- Executes on Snowflake warehouse
- Scales to any data size
- Column names are case-insensitive
- Uses Snowflake SQL functions
Pandas DataFrame
- Eager evaluation (executes immediately)
- Executes on local machine memory
- Limited by available RAM
- Column names are case-sensitive
- Uses NumPy-based operations
User-Defined Functions (UDFs)
UDFs let you extend Snowflake with custom Python logic. For ML, they are commonly used for feature transformations, custom scoring functions, and model inference.
Scalar UDFs
A scalar UDF processes one row at a time. It takes scalar inputs and returns a scalar output.
from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import FloatType
@udf(return_type=FloatType(), input_types=[FloatType(), FloatType()])
def calculate_ratio(numerator, denominator):
if denominator == 0:
return 0.0
return numerator / denominator
# Use in DataFrame
df = df.with_column("SPEND_RATIO", calculate_ratio(df["TOTAL_SPEND"], df["VISIT_COUNT"]))
Vectorized UDFs (UDTFs)
Vectorized UDFs process data in batches using pandas Series, which is significantly faster for ML workloads. The exam frequently tests the difference between scalar and vectorized UDFs.
from snowflake.snowpark.functions import pandas_udf
from snowflake.snowpark.types import PandasSeriesType, FloatType
import pandas as pd
@pandas_udf(return_type=PandasSeriesType(FloatType()),
input_types=[PandasSeriesType(FloatType())])
def normalize_feature(series: pd.Series) -> pd.Series:
return (series - series.mean()) / series.std()
When to Use Each UDF Type
- Scalar UDF: Simple row-level transformations, string manipulation, conditional logic
- Vectorized UDF: Numerical computations, ML inference, feature engineering on batches, anything that benefits from pandas/NumPy
- Table Function (UDTF): One-to-many transformations, exploding rows, generating multiple output rows per input
Stored Procedures for ML
Stored procedures run arbitrary Python code on Snowflake compute. Unlike UDFs, they are not called per-row — they execute as a standalone unit of work. This makes them ideal for ML training jobs.
from snowflake.snowpark.functions import sproc
@sproc(return_type=StringType(), input_types=[StringType(), StringType()])
def train_model(session, table_name, model_name):
# Read training data
df = session.table(table_name).to_pandas()
# Train model
from sklearn.ensemble import RandomForestClassifier
X = df.drop("LABEL", axis=1)
y = df["LABEL"]
model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)
# Save model to stage
import joblib
joblib.dump(model, "/tmp/model.pkl")
session.file.put("/tmp/model.pkl", f"@models/{model_name}/")
return f"Model {model_name} trained successfully"
Package Management
Snowpark provides a curated set of third-party packages available in the Anaconda channel for Snowflake. You can specify package dependencies when creating UDFs and stored procedures.
# Specify packages for a UDF
@udf(return_type=FloatType(),
input_types=[FloatType()],
packages=["scikit-learn==1.3.0", "numpy"])
def predict_score(feature_value):
import numpy as np
# Model inference logic
return float(np.log1p(feature_value))
# Check available packages
session.sql("SELECT * FROM information_schema.packages WHERE language = 'python'").show()
Stages for ML Artifacts
Snowflake stages are used to store ML artifacts such as trained models, configuration files, and datasets. Understanding stages is important for the ML workflow.
- Internal stages: Store files directly in Snowflake (named stages, table stages, user stages)
- External stages: Reference files in S3, Azure Blob, or GCS
- PUT/GET: Upload and download files between local filesystem and stages
- @~ (user stage): Personal storage area for each user
- @% (table stage): Associated with a specific table
Practice Questions
Question 1
A) Create a scalar UDF with the normalization logic
B) Use a vectorized UDF (pandas_udf) with the normalization logic
C) Export the data to pandas using .to_pandas() and normalize locally
D) Write a stored procedure that processes the data in chunks
Answer: B — Vectorized UDFs process data in Arrow-based batches using pandas Series, making them 10-100x faster than scalar UDFs for numerical operations. Option C would fail or be extremely slow on 500M rows due to memory constraints. Option D adds unnecessary complexity when a vectorized UDF handles this natively.
Question 2
A) A scalar UDF that trains the model on each row
B) A vectorized UDF that trains the model on each batch
C) A stored procedure that reads the data, trains the model, and uploads it to a stage
D) A table function (UDTF) that outputs model parameters
Answer: C — Stored procedures are designed for complex multi-step workflows like training pipelines: read data, train model, evaluate, save artifacts. UDFs (scalar or vectorized) are per-row/per-batch transforms within a query and are not suited for full training workflows. A UDTF could output parameters but cannot manage the full lifecycle.
Question 3
A) Snowpark DataFrames execute operations immediately like pandas
B) Snowpark DataFrames push computation to the Snowflake warehouse and use lazy evaluation
C) Snowpark DataFrames require data to be loaded into client memory before transformation
D) Snowpark DataFrames cannot perform joins or aggregations
Answer: B — Snowpark DataFrames use lazy evaluation, building a query plan that executes on the Snowflake warehouse only when an action (collect, show, count, write) is triggered. This is the fundamental difference from pandas: data stays in Snowflake, and computation scales with the warehouse.
Question 4
A) Install XGBoost on the local machine and it will automatically be available in Snowflake
B) Specify XGBoost in the packages parameter of the UDF decorator
C) Upload the XGBoost wheel file to a stage and import it manually
D) XGBoost is not supported in Snowpark UDFs
Answer: B — Snowpark provides a curated Anaconda channel with pre-approved packages including XGBoost. You specify dependencies using the
packages parameter in the UDF or stored procedure decorator. Snowflake resolves and caches these packages automatically. While uploading custom packages to a stage is possible for unlisted libraries, XGBoost is available through the standard channel.
Question 5
A) Snowpark supports more ML algorithms than external platforms
B) Snowpark eliminates data movement by pushing compute to the data
C) Snowpark provides better visualization tools
D) Snowpark automatically selects the best ML algorithm
Answer: B — The core value proposition of Snowpark is eliminating data movement. Instead of extracting large datasets to external compute, Snowpark runs your Python/ML code directly on Snowflake's compute infrastructure. This reduces latency, improves security (data never leaves Snowflake), and leverages Snowflake's elastic scaling.
Lilly Tech Systems