Model Training
Train machine learning models directly inside Snowflake using the Snowpark ML library, built-in ML functions, and popular framework integrations — with hyperparameter tuning and cross-validation strategies for the exam.
The snowflake-ml-python Library
The snowflake-ml-python package (also called Snowpark ML) is Snowflake's official ML library. It provides scikit-learn-compatible APIs that train models on Snowflake's compute, keeping data inside the platform.
snowflake.ml.modeling (distributed training on Snowflake) and using raw scikit-learn in a stored procedure (single-node training). The former scales automatically; the latter is limited to the stored procedure's memory.Snowpark ML Modeling API
Snowpark ML provides wrappers for popular ML algorithms that accept Snowpark DataFrames directly and train on Snowflake's distributed compute.
from snowflake.ml.modeling.ensemble import RandomForestClassifier
from snowflake.ml.modeling.xgboost import XGBClassifier
from snowflake.ml.modeling.linear_model import LogisticRegression
# Define feature and label columns
FEATURE_COLS = ["SPEND_SCALED", "VISITS_SCALED", "DAYS_SINCE_LAST", "AGE"]
LABEL_COL = "CHURNED"
# Train a Random Forest using Snowpark ML
rf_model = RandomForestClassifier(
input_cols=FEATURE_COLS,
label_cols=[LABEL_COL],
output_cols=["PREDICTED_CHURN"],
n_estimators=100,
max_depth=10,
random_state=42
)
# fit() trains on Snowflake's compute
rf_model.fit(df_train)
# predict() returns a Snowpark DataFrame
df_predictions = rf_model.predict(df_test)
Supported Algorithms
Classification
- RandomForestClassifier
- XGBClassifier
- LightGBMClassifier
- LogisticRegression
- GradientBoostingClassifier
- SVC (Support Vector)
Regression
- RandomForestRegressor
- XGBRegressor
- LightGBMRegressor
- LinearRegression
- GradientBoostingRegressor
- SVR (Support Vector)
Preprocessing
- StandardScaler
- MinMaxScaler
- OneHotEncoder
- OrdinalEncoder
- SimpleImputer
- Pipeline (chain transforms)
Built-in ML Functions
Snowflake also provides SQL-level ML functions that require no Python code. These are useful for quick analyses and are tested on the exam.
Forecasting
-- Create a time-series forecast model
CREATE OR REPLACE SNOWFLAKE.ML.FORECAST sales_forecast(
INPUT_DATA => SYSTEM$REFERENCE('TABLE', 'DAILY_SALES'),
TIMESTAMP_COLNAME => 'SALE_DATE',
TARGET_COLNAME => 'REVENUE',
SERIES_COLNAME => 'PRODUCT_ID'
);
-- Generate forecasts
CALL sales_forecast!FORECAST(
FORECASTING_PERIODS => 30,
CONFIG_OBJECT => {'prediction_interval': 0.95}
);
Anomaly Detection
-- Create an anomaly detection model
CREATE OR REPLACE SNOWFLAKE.ML.ANOMALY_DETECTION transaction_monitor(
INPUT_DATA => SYSTEM$REFERENCE('TABLE', 'TRANSACTIONS'),
TIMESTAMP_COLNAME => 'TXN_DATE',
TARGET_COLNAME => 'AMOUNT',
LABEL_COLNAME => '' -- unsupervised
);
-- Detect anomalies
CALL transaction_monitor!DETECT_ANOMALIES(
INPUT_DATA => SYSTEM$REFERENCE('TABLE', 'NEW_TRANSACTIONS'),
TIMESTAMP_COLNAME => 'TXN_DATE',
TARGET_COLNAME => 'AMOUNT',
CONFIG_OBJECT => {'prediction_interval': 0.99}
);
Classification (Cortex ML)
-- Create a classification model
CREATE OR REPLACE SNOWFLAKE.ML.CLASSIFICATION churn_classifier(
INPUT_DATA => SYSTEM$REFERENCE('TABLE', 'TRAINING_DATA'),
TARGET_COLNAME => 'CHURNED'
);
-- Predict on new data
SELECT *, churn_classifier!PREDICT(
INPUT_DATA => OBJECT_CONSTRUCT(*)
) AS PREDICTION
FROM NEW_CUSTOMERS;
Training with External Frameworks
For models that require specific frameworks, you can train inside stored procedures with full access to scikit-learn, XGBoost, LightGBM, and PyTorch.
from snowflake.snowpark.functions import sproc
from snowflake.snowpark.types import StringType
@sproc(return_type=StringType(),
input_types=[StringType()],
packages=["snowflake-ml-python", "xgboost", "scikit-learn", "joblib"])
def train_xgboost(session, training_table):
import xgboost as xgb
from sklearn.model_selection import cross_val_score
import joblib
# Read data into pandas (runs inside Snowflake compute)
df = session.table(training_table).to_pandas()
X = df.drop("TARGET", axis=1)
y = df["TARGET"]
# Train with cross-validation
model = xgb.XGBClassifier(
n_estimators=200,
max_depth=6,
learning_rate=0.1,
eval_metric="logloss"
)
scores = cross_val_score(model, X, y, cv=5, scoring="roc_auc")
model.fit(X, y)
# Save to stage
joblib.dump(model, "/tmp/xgb_model.pkl")
session.file.put("/tmp/xgb_model.pkl", "@ML_MODELS/xgboost/")
return f"Model trained. CV AUC: {scores.mean():.4f} (+/- {scores.std():.4f})"
Snowpark ML Pipelines
Just like scikit-learn's Pipeline, Snowpark ML supports chaining preprocessing and modeling steps.
from snowflake.ml.modeling.pipeline import Pipeline
from snowflake.ml.modeling.preprocessing import StandardScaler, OneHotEncoder
from snowflake.ml.modeling.ensemble import RandomForestClassifier
# Build a pipeline
pipeline = Pipeline(steps=[
("scaler", StandardScaler(
input_cols=["SPEND", "VISITS"],
output_cols=["SPEND_SCALED", "VISITS_SCALED"]
)),
("encoder", OneHotEncoder(
input_cols=["CATEGORY"],
output_cols=["CATEGORY_ENCODED"]
)),
("classifier", RandomForestClassifier(
input_cols=["SPEND_SCALED", "VISITS_SCALED", "CATEGORY_ENCODED"],
label_cols=["CHURNED"],
output_cols=["PREDICTION"]
))
])
# Fit and predict
pipeline.fit(df_train)
df_results = pipeline.predict(df_test)
Model Evaluation
Snowpark ML provides evaluation metrics that work directly with Snowpark DataFrames.
from snowflake.ml.modeling.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, mean_squared_error, log_loss
)
# Classification metrics
accuracy = accuracy_score(df=df_predictions, y_true_col_names="CHURNED", y_pred_col_names="PREDICTION")
precision = precision_score(df=df_predictions, y_true_col_names="CHURNED", y_pred_col_names="PREDICTION")
recall = recall_score(df=df_predictions, y_true_col_names="CHURNED", y_pred_col_names="PREDICTION")
f1 = f1_score(df=df_predictions, y_true_col_names="CHURNED", y_pred_col_names="PREDICTION")
Hyperparameter Tuning
Snowpark ML supports grid search and random search for hyperparameter optimization.
from snowflake.ml.modeling.model_selection import GridSearchCV, RandomizedSearchCV
# Grid search
grid_search = GridSearchCV(
estimator=RandomForestClassifier(
input_cols=FEATURE_COLS,
label_cols=[LABEL_COL],
output_cols=["PREDICTION"]
),
param_grid={
"n_estimators": [50, 100, 200],
"max_depth": [5, 10, 15],
"min_samples_split": [2, 5, 10]
},
scoring="roc_auc",
cv=5,
input_cols=FEATURE_COLS,
label_cols=[LABEL_COL],
output_cols=["PREDICTION"]
)
grid_search.fit(df_train)
best_params = grid_search.to_sklearn().best_params_
best_score = grid_search.to_sklearn().best_score_
Practice Questions
Question 1
A) Export the data to an external cluster and train with XGBoost
B) Use snowflake.ml.modeling.xgboost.XGBClassifier with a Snowpark DataFrame
C) Use a stored procedure with scikit-learn's GradientBoostingClassifier
D) Use the SQL CLASSIFICATION built-in function
Answer: B — The Snowpark ML XGBClassifier accepts Snowpark DataFrames directly and leverages Snowflake's distributed compute for training. No data extraction is needed. A stored procedure (C) would need to call .to_pandas(), loading the full 100GB into memory. The built-in CLASSIFICATION (D) does not support XGBoost specifically. Exporting (A) defeats the purpose of in-platform training.
Question 2
A) Train an ARIMA model using a stored procedure with statsmodels
B) Use the Snowflake built-in FORECAST function
C) Build a custom LSTM model in a Snowpark Container Service
D) Export data to an external AutoML platform
Answer: B — The built-in FORECAST function is a SQL-only solution that handles time-series forecasting with minimal effort. It automatically handles seasonality, trend, and multiple series. All other options require significantly more setup and code for a standard forecasting task.
Question 3
A) On the client machine that initiated the session
B) On an external cloud compute service
C) On the Snowflake warehouse specified in the session
D) On a dedicated Snowflake optimization service
Answer: C — All Snowpark ML operations, including GridSearchCV, execute on the Snowflake warehouse specified in the session. The warehouse provides the compute resources for training each fold and parameter combination. Larger warehouses can process these faster. No data leaves Snowflake.
Question 4
A) Accuracy (it is already 99%)
B) R-squared score
C) Recall and F1 score
D) Mean absolute error
Answer: C — With 1% fraud rate, a model that predicts "not fraud" for everything achieves 99% accuracy but catches zero fraud. Recall measures how many actual fraudulent transactions are detected (critical for fraud). F1 balances recall with precision. R-squared and MAE are regression metrics, not applicable here.
Lilly Tech Systems