Intermediate

Snowpark for ML

Master the Snowpark Python API for machine learning workloads — DataFrame operations, user-defined functions, vectorized UDFs, and stored procedures that execute directly inside Snowflake's compute engine.

What Is Snowpark?

Snowpark is Snowflake's developer framework that lets you write data processing and ML code in Python (also Java and Scala) that runs directly on Snowflake's compute infrastructure. Instead of extracting data to an external environment, Snowpark pushes your code to the data — eliminating data movement and leveraging Snowflake's elastic compute.

💡
Exam focus: The exam tests your understanding of WHEN to use Snowpark vs. traditional SQL, how DataFrames differ from pandas DataFrames, and the trade-offs between UDFs, vectorized UDFs, and stored procedures for ML workloads.

Snowpark Session

Every Snowpark interaction begins with a Session object. The session establishes the connection to Snowflake and serves as the entry point for all DataFrame operations.

from snowflake.snowpark import Session

connection_params = {
    "account": "myaccount",
    "user": "myuser",
    "password": "mypassword",
    "warehouse": "ML_WH",
    "database": "ML_DB",
    "schema": "FEATURES"
}

session = Session.builder.configs(connection_params).create()

Key session concepts for the exam:

  • Lazy evaluation: DataFrame operations build a query plan but do not execute until an action (collect, show, count, write) is called
  • Server-side execution: Transformations run on Snowflake's compute, not your local machine
  • Warehouse selection: The warehouse determines the compute resources available for your ML workload
  • Session scope: Each session has a default database, schema, and warehouse that can be changed at any time

Snowpark DataFrames

Snowpark DataFrames look similar to pandas but operate very differently. They represent a lazy query plan executed on Snowflake's engine.

# Read from a table
df = session.table("CUSTOMER_FEATURES")

# Select and filter
df_filtered = df.select("CUSTOMER_ID", "TOTAL_SPEND", "VISIT_COUNT") \
    .filter(df["TOTAL_SPEND"] > 100)

# Aggregations
df_agg = df.group_by("SEGMENT").agg(
    avg("TOTAL_SPEND").alias("AVG_SPEND"),
    count("CUSTOMER_ID").alias("CUSTOMER_COUNT")
)

# Join DataFrames
df_joined = df_features.join(df_labels, on="CUSTOMER_ID", how="inner")
Common exam trap: Snowpark DataFrames are NOT pandas DataFrames. Calling .to_pandas() pulls all data to the client — this defeats the purpose of server-side processing and can cause out-of-memory errors on large datasets. The exam tests whether you know to keep operations in Snowpark DataFrames as long as possible.

DataFrame vs. Pandas Comparison

Snowpark DataFrame

  • Lazy evaluation (builds query plan)
  • Executes on Snowflake warehouse
  • Scales to any data size
  • Column names are case-insensitive
  • Uses Snowflake SQL functions

Pandas DataFrame

  • Eager evaluation (executes immediately)
  • Executes on local machine memory
  • Limited by available RAM
  • Column names are case-sensitive
  • Uses NumPy-based operations

User-Defined Functions (UDFs)

UDFs let you extend Snowflake with custom Python logic. For ML, they are commonly used for feature transformations, custom scoring functions, and model inference.

Scalar UDFs

A scalar UDF processes one row at a time. It takes scalar inputs and returns a scalar output.

from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import FloatType

@udf(return_type=FloatType(), input_types=[FloatType(), FloatType()])
def calculate_ratio(numerator, denominator):
    if denominator == 0:
        return 0.0
    return numerator / denominator

# Use in DataFrame
df = df.with_column("SPEND_RATIO", calculate_ratio(df["TOTAL_SPEND"], df["VISIT_COUNT"]))

Vectorized UDFs (UDTFs)

Vectorized UDFs process data in batches using pandas Series, which is significantly faster for ML workloads. The exam frequently tests the difference between scalar and vectorized UDFs.

from snowflake.snowpark.functions import pandas_udf
from snowflake.snowpark.types import PandasSeriesType, FloatType
import pandas as pd

@pandas_udf(return_type=PandasSeriesType(FloatType()),
            input_types=[PandasSeriesType(FloatType())])
def normalize_feature(series: pd.Series) -> pd.Series:
    return (series - series.mean()) / series.std()
💡
Exam tip: Use vectorized UDFs (pandas_udf) for ML feature transformations and batch scoring. They are 10-100x faster than scalar UDFs because they process data in Arrow-based batches rather than row-by-row. The exam expects you to know when vectorized UDFs are the right choice.

When to Use Each UDF Type

  • Scalar UDF: Simple row-level transformations, string manipulation, conditional logic
  • Vectorized UDF: Numerical computations, ML inference, feature engineering on batches, anything that benefits from pandas/NumPy
  • Table Function (UDTF): One-to-many transformations, exploding rows, generating multiple output rows per input

Stored Procedures for ML

Stored procedures run arbitrary Python code on Snowflake compute. Unlike UDFs, they are not called per-row — they execute as a standalone unit of work. This makes them ideal for ML training jobs.

from snowflake.snowpark.functions import sproc

@sproc(return_type=StringType(), input_types=[StringType(), StringType()])
def train_model(session, table_name, model_name):
    # Read training data
    df = session.table(table_name).to_pandas()

    # Train model
    from sklearn.ensemble import RandomForestClassifier
    X = df.drop("LABEL", axis=1)
    y = df["LABEL"]
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X, y)

    # Save model to stage
    import joblib
    joblib.dump(model, "/tmp/model.pkl")
    session.file.put("/tmp/model.pkl", f"@models/{model_name}/")

    return f"Model {model_name} trained successfully"
📚
Key distinction: UDFs are for per-row or per-batch transformations within a SQL query. Stored procedures are for complex workflows that include multiple steps (data prep, training, evaluation, saving). The exam tests this distinction frequently.

Package Management

Snowpark provides a curated set of third-party packages available in the Anaconda channel for Snowflake. You can specify package dependencies when creating UDFs and stored procedures.

# Specify packages for a UDF
@udf(return_type=FloatType(),
     input_types=[FloatType()],
     packages=["scikit-learn==1.3.0", "numpy"])
def predict_score(feature_value):
    import numpy as np
    # Model inference logic
    return float(np.log1p(feature_value))

# Check available packages
session.sql("SELECT * FROM information_schema.packages WHERE language = 'python'").show()

Stages for ML Artifacts

Snowflake stages are used to store ML artifacts such as trained models, configuration files, and datasets. Understanding stages is important for the ML workflow.

  • Internal stages: Store files directly in Snowflake (named stages, table stages, user stages)
  • External stages: Reference files in S3, Azure Blob, or GCS
  • PUT/GET: Upload and download files between local filesystem and stages
  • @~ (user stage): Personal storage area for each user
  • @% (table stage): Associated with a specific table

Practice Questions

Question 1

Q1
A data scientist needs to apply a custom normalization function to a numeric column containing 500 million rows in a Snowflake table. The function uses NumPy for computation. Which approach provides the best performance?

A) Create a scalar UDF with the normalization logic
B) Use a vectorized UDF (pandas_udf) with the normalization logic
C) Export the data to pandas using .to_pandas() and normalize locally
D) Write a stored procedure that processes the data in chunks

Answer: B — Vectorized UDFs process data in Arrow-based batches using pandas Series, making them 10-100x faster than scalar UDFs for numerical operations. Option C would fail or be extremely slow on 500M rows due to memory constraints. Option D adds unnecessary complexity when a vectorized UDF handles this natively.

Question 2

Q2
A team wants to train a scikit-learn model on a dataset stored in Snowflake and save the trained model artifact back to a Snowflake stage. Which Snowpark construct should they use?

A) A scalar UDF that trains the model on each row
B) A vectorized UDF that trains the model on each batch
C) A stored procedure that reads the data, trains the model, and uploads it to a stage
D) A table function (UDTF) that outputs model parameters

Answer: C — Stored procedures are designed for complex multi-step workflows like training pipelines: read data, train model, evaluate, save artifacts. UDFs (scalar or vectorized) are per-row/per-batch transforms within a query and are not suited for full training workflows. A UDTF could output parameters but cannot manage the full lifecycle.

Question 3

Q3
Which statement about Snowpark DataFrames is TRUE?

A) Snowpark DataFrames execute operations immediately like pandas
B) Snowpark DataFrames push computation to the Snowflake warehouse and use lazy evaluation
C) Snowpark DataFrames require data to be loaded into client memory before transformation
D) Snowpark DataFrames cannot perform joins or aggregations

Answer: B — Snowpark DataFrames use lazy evaluation, building a query plan that executes on the Snowflake warehouse only when an action (collect, show, count, write) is triggered. This is the fundamental difference from pandas: data stays in Snowflake, and computation scales with the warehouse.

Question 4

Q4
A data engineer needs to use the XGBoost library inside a Snowpark UDF. How should they make the library available?

A) Install XGBoost on the local machine and it will automatically be available in Snowflake
B) Specify XGBoost in the packages parameter of the UDF decorator
C) Upload the XGBoost wheel file to a stage and import it manually
D) XGBoost is not supported in Snowpark UDFs

Answer: B — Snowpark provides a curated Anaconda channel with pre-approved packages including XGBoost. You specify dependencies using the packages parameter in the UDF or stored procedure decorator. Snowflake resolves and caches these packages automatically. While uploading custom packages to a stage is possible for unlisted libraries, XGBoost is available through the standard channel.

Question 5

Q5
What is the primary advantage of using Snowpark for ML workloads compared to extracting data to an external ML platform?

A) Snowpark supports more ML algorithms than external platforms
B) Snowpark eliminates data movement by pushing compute to the data
C) Snowpark provides better visualization tools
D) Snowpark automatically selects the best ML algorithm

Answer: B — The core value proposition of Snowpark is eliminating data movement. Instead of extracting large datasets to external compute, Snowpark runs your Python/ML code directly on Snowflake's compute infrastructure. This reduces latency, improves security (data never leaves Snowflake), and leverages Snowflake's elastic scaling.