Intermediate

Model Training

Train machine learning models directly inside Snowflake using the Snowpark ML library, built-in ML functions, and popular framework integrations — with hyperparameter tuning and cross-validation strategies for the exam.

The snowflake-ml-python Library

The snowflake-ml-python package (also called Snowpark ML) is Snowflake's official ML library. It provides scikit-learn-compatible APIs that train models on Snowflake's compute, keeping data inside the platform.

💡
Exam focus: The exam heavily tests the snowflake-ml-python library. Know the difference between snowflake.ml.modeling (distributed training on Snowflake) and using raw scikit-learn in a stored procedure (single-node training). The former scales automatically; the latter is limited to the stored procedure's memory.

Snowpark ML Modeling API

Snowpark ML provides wrappers for popular ML algorithms that accept Snowpark DataFrames directly and train on Snowflake's distributed compute.

from snowflake.ml.modeling.ensemble import RandomForestClassifier
from snowflake.ml.modeling.xgboost import XGBClassifier
from snowflake.ml.modeling.linear_model import LogisticRegression

# Define feature and label columns
FEATURE_COLS = ["SPEND_SCALED", "VISITS_SCALED", "DAYS_SINCE_LAST", "AGE"]
LABEL_COL = "CHURNED"

# Train a Random Forest using Snowpark ML
rf_model = RandomForestClassifier(
    input_cols=FEATURE_COLS,
    label_cols=[LABEL_COL],
    output_cols=["PREDICTED_CHURN"],
    n_estimators=100,
    max_depth=10,
    random_state=42
)

# fit() trains on Snowflake's compute
rf_model.fit(df_train)

# predict() returns a Snowpark DataFrame
df_predictions = rf_model.predict(df_test)

Supported Algorithms

Classification

  • RandomForestClassifier
  • XGBClassifier
  • LightGBMClassifier
  • LogisticRegression
  • GradientBoostingClassifier
  • SVC (Support Vector)

Regression

  • RandomForestRegressor
  • XGBRegressor
  • LightGBMRegressor
  • LinearRegression
  • GradientBoostingRegressor
  • SVR (Support Vector)

Preprocessing

  • StandardScaler
  • MinMaxScaler
  • OneHotEncoder
  • OrdinalEncoder
  • SimpleImputer
  • Pipeline (chain transforms)

Built-in ML Functions

Snowflake also provides SQL-level ML functions that require no Python code. These are useful for quick analyses and are tested on the exam.

Forecasting

-- Create a time-series forecast model
CREATE OR REPLACE SNOWFLAKE.ML.FORECAST sales_forecast(
    INPUT_DATA => SYSTEM$REFERENCE('TABLE', 'DAILY_SALES'),
    TIMESTAMP_COLNAME => 'SALE_DATE',
    TARGET_COLNAME => 'REVENUE',
    SERIES_COLNAME => 'PRODUCT_ID'
);

-- Generate forecasts
CALL sales_forecast!FORECAST(
    FORECASTING_PERIODS => 30,
    CONFIG_OBJECT => {'prediction_interval': 0.95}
);

Anomaly Detection

-- Create an anomaly detection model
CREATE OR REPLACE SNOWFLAKE.ML.ANOMALY_DETECTION transaction_monitor(
    INPUT_DATA => SYSTEM$REFERENCE('TABLE', 'TRANSACTIONS'),
    TIMESTAMP_COLNAME => 'TXN_DATE',
    TARGET_COLNAME => 'AMOUNT',
    LABEL_COLNAME => ''  -- unsupervised
);

-- Detect anomalies
CALL transaction_monitor!DETECT_ANOMALIES(
    INPUT_DATA => SYSTEM$REFERENCE('TABLE', 'NEW_TRANSACTIONS'),
    TIMESTAMP_COLNAME => 'TXN_DATE',
    TARGET_COLNAME => 'AMOUNT',
    CONFIG_OBJECT => {'prediction_interval': 0.99}
);

Classification (Cortex ML)

-- Create a classification model
CREATE OR REPLACE SNOWFLAKE.ML.CLASSIFICATION churn_classifier(
    INPUT_DATA => SYSTEM$REFERENCE('TABLE', 'TRAINING_DATA'),
    TARGET_COLNAME => 'CHURNED'
);

-- Predict on new data
SELECT *, churn_classifier!PREDICT(
    INPUT_DATA => OBJECT_CONSTRUCT(*)
) AS PREDICTION
FROM NEW_CUSTOMERS;
📚
Built-in vs. Snowpark ML: Built-in ML functions (FORECAST, ANOMALY_DETECTION, CLASSIFICATION) are SQL-only, require no Python, and are best for common use cases. Snowpark ML provides full Python flexibility for custom models and complex pipelines. The exam tests when to use each approach.

Training with External Frameworks

For models that require specific frameworks, you can train inside stored procedures with full access to scikit-learn, XGBoost, LightGBM, and PyTorch.

from snowflake.snowpark.functions import sproc
from snowflake.snowpark.types import StringType

@sproc(return_type=StringType(),
       input_types=[StringType()],
       packages=["snowflake-ml-python", "xgboost", "scikit-learn", "joblib"])
def train_xgboost(session, training_table):
    import xgboost as xgb
    from sklearn.model_selection import cross_val_score
    import joblib

    # Read data into pandas (runs inside Snowflake compute)
    df = session.table(training_table).to_pandas()
    X = df.drop("TARGET", axis=1)
    y = df["TARGET"]

    # Train with cross-validation
    model = xgb.XGBClassifier(
        n_estimators=200,
        max_depth=6,
        learning_rate=0.1,
        eval_metric="logloss"
    )
    scores = cross_val_score(model, X, y, cv=5, scoring="roc_auc")
    model.fit(X, y)

    # Save to stage
    joblib.dump(model, "/tmp/xgb_model.pkl")
    session.file.put("/tmp/xgb_model.pkl", "@ML_MODELS/xgboost/")

    return f"Model trained. CV AUC: {scores.mean():.4f} (+/- {scores.std():.4f})"

Snowpark ML Pipelines

Just like scikit-learn's Pipeline, Snowpark ML supports chaining preprocessing and modeling steps.

from snowflake.ml.modeling.pipeline import Pipeline
from snowflake.ml.modeling.preprocessing import StandardScaler, OneHotEncoder
from snowflake.ml.modeling.ensemble import RandomForestClassifier

# Build a pipeline
pipeline = Pipeline(steps=[
    ("scaler", StandardScaler(
        input_cols=["SPEND", "VISITS"],
        output_cols=["SPEND_SCALED", "VISITS_SCALED"]
    )),
    ("encoder", OneHotEncoder(
        input_cols=["CATEGORY"],
        output_cols=["CATEGORY_ENCODED"]
    )),
    ("classifier", RandomForestClassifier(
        input_cols=["SPEND_SCALED", "VISITS_SCALED", "CATEGORY_ENCODED"],
        label_cols=["CHURNED"],
        output_cols=["PREDICTION"]
    ))
])

# Fit and predict
pipeline.fit(df_train)
df_results = pipeline.predict(df_test)

Model Evaluation

Snowpark ML provides evaluation metrics that work directly with Snowpark DataFrames.

from snowflake.ml.modeling.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, mean_squared_error, log_loss
)

# Classification metrics
accuracy = accuracy_score(df=df_predictions, y_true_col_names="CHURNED", y_pred_col_names="PREDICTION")
precision = precision_score(df=df_predictions, y_true_col_names="CHURNED", y_pred_col_names="PREDICTION")
recall = recall_score(df=df_predictions, y_true_col_names="CHURNED", y_pred_col_names="PREDICTION")
f1 = f1_score(df=df_predictions, y_true_col_names="CHURNED", y_pred_col_names="PREDICTION")
Exam tip: Know when to use each metric. Accuracy is misleading for imbalanced datasets. Precision matters when false positives are costly (spam filtering). Recall matters when false negatives are costly (fraud detection, disease diagnosis). F1 balances both. AUC-ROC evaluates ranking quality regardless of threshold.

Hyperparameter Tuning

Snowpark ML supports grid search and random search for hyperparameter optimization.

from snowflake.ml.modeling.model_selection import GridSearchCV, RandomizedSearchCV

# Grid search
grid_search = GridSearchCV(
    estimator=RandomForestClassifier(
        input_cols=FEATURE_COLS,
        label_cols=[LABEL_COL],
        output_cols=["PREDICTION"]
    ),
    param_grid={
        "n_estimators": [50, 100, 200],
        "max_depth": [5, 10, 15],
        "min_samples_split": [2, 5, 10]
    },
    scoring="roc_auc",
    cv=5,
    input_cols=FEATURE_COLS,
    label_cols=[LABEL_COL],
    output_cols=["PREDICTION"]
)

grid_search.fit(df_train)
best_params = grid_search.to_sklearn().best_params_
best_score = grid_search.to_sklearn().best_score_

Practice Questions

Question 1

Q1
A data scientist wants to train a gradient-boosted classifier on a 100GB dataset stored in Snowflake. They need distributed training without extracting data. Which approach should they use?

A) Export the data to an external cluster and train with XGBoost
B) Use snowflake.ml.modeling.xgboost.XGBClassifier with a Snowpark DataFrame
C) Use a stored procedure with scikit-learn's GradientBoostingClassifier
D) Use the SQL CLASSIFICATION built-in function

Answer: B — The Snowpark ML XGBClassifier accepts Snowpark DataFrames directly and leverages Snowflake's distributed compute for training. No data extraction is needed. A stored procedure (C) would need to call .to_pandas(), loading the full 100GB into memory. The built-in CLASSIFICATION (D) does not support XGBoost specifically. Exporting (A) defeats the purpose of in-platform training.

Question 2

Q2
A team has a simple time-series forecasting need: predict daily sales for the next 30 days per product. They have no custom model requirements. Which approach requires the least effort?

A) Train an ARIMA model using a stored procedure with statsmodels
B) Use the Snowflake built-in FORECAST function
C) Build a custom LSTM model in a Snowpark Container Service
D) Export data to an external AutoML platform

Answer: B — The built-in FORECAST function is a SQL-only solution that handles time-series forecasting with minimal effort. It automatically handles seasonality, trend, and multiple series. All other options require significantly more setup and code for a standard forecasting task.

Question 3

Q3
When using Snowpark ML's GridSearchCV for hyperparameter tuning, where does the cross-validation computation execute?

A) On the client machine that initiated the session
B) On an external cloud compute service
C) On the Snowflake warehouse specified in the session
D) On a dedicated Snowflake optimization service

Answer: C — All Snowpark ML operations, including GridSearchCV, execute on the Snowflake warehouse specified in the session. The warehouse provides the compute resources for training each fold and parameter combination. Larger warehouses can process these faster. No data leaves Snowflake.

Question 4

Q4
A fraud detection model has 99% accuracy on a dataset where only 1% of transactions are fraudulent. Which metric should the team prioritize instead?

A) Accuracy (it is already 99%)
B) R-squared score
C) Recall and F1 score
D) Mean absolute error

Answer: C — With 1% fraud rate, a model that predicts "not fraud" for everything achieves 99% accuracy but catches zero fraud. Recall measures how many actual fraudulent transactions are detected (critical for fraud). F1 balances recall with precision. R-squared and MAE are regression metrics, not applicable here.