Advanced
Practice Exam
25 exam-style questions covering all domains of the Snowflake ML Specialization exam. Try to answer each question before reading the explanation. Target time: 45 minutes (matching the real exam pace of ~1.75 min/question).
Exam simulation: Cover the answer explanations as you go. Write down your answers first, then check. A passing score would be roughly 18/25 correct (72%).
Question 1 — Snowpark Fundamentals
A data scientist creates a Snowpark DataFrame by calling
A) Immediately when session.table() is called
B) After each filter or select operation
C) Only when an action like .collect(), .show(), or .count() is called
D) When the Python script finishes executing
Answer: C — Snowpark uses lazy evaluation. DataFrame operations (filter, select, join) build a query plan but do not execute. The query runs on the warehouse only when an action (collect, show, count, write, to_pandas) is triggered.
session.table("CUSTOMERS") and then chains multiple filter and select operations. When does the query execute on the Snowflake warehouse?A) Immediately when session.table() is called
B) After each filter or select operation
C) Only when an action like .collect(), .show(), or .count() is called
D) When the Python script finishes executing
Answer: C — Snowpark uses lazy evaluation. DataFrame operations (filter, select, join) build a query plan but do not execute. The query runs on the warehouse only when an action (collect, show, count, write, to_pandas) is triggered.
Question 2 — Snowpark UDFs
A team needs to apply a custom ML scoring function to 200 million rows. The function uses NumPy for matrix operations. Which UDF type provides the best performance?
A) Scalar UDF
B) Vectorized UDF (pandas_udf)
C) JavaScript UDF
D) SQL UDF
Answer: B — Vectorized UDFs process data in Arrow-based batches using pandas Series, which is 10-100x faster than row-by-row scalar UDFs for numerical operations. NumPy operations on batches are highly optimized compared to row-by-row processing.
A) Scalar UDF
B) Vectorized UDF (pandas_udf)
C) JavaScript UDF
D) SQL UDF
Answer: B — Vectorized UDFs process data in Arrow-based batches using pandas Series, which is 10-100x faster than row-by-row scalar UDFs for numerical operations. NumPy operations on batches are highly optimized compared to row-by-row processing.
Question 3 — Snowpark Stored Procedures
Which statement about Snowpark stored procedures for ML is TRUE?
A) Stored procedures can only use SQL, not Python
B) Stored procedures execute on the client machine
C) Stored procedures can import third-party Python packages from the Anaconda channel
D) Stored procedures cannot access Snowflake stages
Answer: C — Snowpark stored procedures support Python and can import packages from Snowflake's curated Anaconda channel using the packages parameter. They execute on Snowflake's compute (not the client) and have full access to stages for reading/writing files.
A) Stored procedures can only use SQL, not Python
B) Stored procedures execute on the client machine
C) Stored procedures can import third-party Python packages from the Anaconda channel
D) Stored procedures cannot access Snowflake stages
Answer: C — Snowpark stored procedures support Python and can import packages from Snowflake's curated Anaconda channel using the packages parameter. They execute on Snowflake's compute (not the client) and have full access to stages for reading/writing files.
Question 4 — Feature Engineering
A data scientist needs to calculate a 7-day rolling average of transaction amounts per customer. Which SQL approach is most efficient?
A) A self-join with a date range filter
B) A window function with ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
C) A stored procedure that loops through each customer
D) A UDTF that computes the rolling average
Answer: B — SQL window functions execute natively on Snowflake's distributed engine and are optimized for this exact pattern. Self-joins are less efficient, stored procedure loops are procedural and slow, and a UDTF adds unnecessary complexity.
A) A self-join with a date range filter
B) A window function with ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
C) A stored procedure that loops through each customer
D) A UDTF that computes the rolling average
Answer: B — SQL window functions execute natively on Snowflake's distributed engine and are optimized for this exact pattern. Self-joins are less efficient, stored procedure loops are procedural and slow, and a UDTF adds unnecessary complexity.
Question 5 — Feature Store
What does the Snowflake Feature Store's point-in-time join prevent?
A) Duplicate features in the store
B) Unauthorized access to features
C) Data leakage from future data into training features
D) Features with missing values
Answer: C — Point-in-time joins ensure that when retrieving features for training, only feature values available at the time of each training example are used. This prevents future data from leaking into historical training examples, which would produce overly optimistic models.
A) Duplicate features in the store
B) Unauthorized access to features
C) Data leakage from future data into training features
D) Features with missing values
Answer: C — Point-in-time joins ensure that when retrieving features for training, only feature values available at the time of each training example are used. This prevents future data from leaking into historical training examples, which would produce overly optimistic models.
Question 6 — Feature Engineering
Which Snowpark ML class should be used to replace missing values in a DataFrame with the median of each column?
A) snowflake.ml.modeling.preprocessing.StandardScaler
B) snowflake.ml.modeling.impute.SimpleImputer with strategy="median"
C) snowflake.ml.modeling.preprocessing.OneHotEncoder
D) snowflake.ml.modeling.preprocessing.MinMaxScaler
Answer: B — SimpleImputer with strategy="median" replaces missing values with the median of each column. StandardScaler normalizes features. OneHotEncoder converts categoricals. MinMaxScaler scales to a range. Only SimpleImputer handles missing data.
A) snowflake.ml.modeling.preprocessing.StandardScaler
B) snowflake.ml.modeling.impute.SimpleImputer with strategy="median"
C) snowflake.ml.modeling.preprocessing.OneHotEncoder
D) snowflake.ml.modeling.preprocessing.MinMaxScaler
Answer: B — SimpleImputer with strategy="median" replaces missing values with the median of each column. StandardScaler normalizes features. OneHotEncoder converts categoricals. MinMaxScaler scales to a range. Only SimpleImputer handles missing data.
Question 7 — Model Training
What is the key advantage of using
A) Snowpark ML's version supports more hyperparameters
B) Snowpark ML's version trains on distributed Snowflake compute without requiring .to_pandas()
C) Snowpark ML's version produces more accurate models
D) Snowpark ML's version does not require a warehouse
Answer: B — The Snowpark ML modeling API accepts Snowpark DataFrames directly and leverages Snowflake's distributed compute for training. A stored procedure using raw scikit-learn would need .to_pandas() to load data into memory, limiting scalability to the procedure's available RAM.
snowflake.ml.modeling.ensemble.RandomForestClassifier over running scikit-learn's RandomForestClassifier in a stored procedure?A) Snowpark ML's version supports more hyperparameters
B) Snowpark ML's version trains on distributed Snowflake compute without requiring .to_pandas()
C) Snowpark ML's version produces more accurate models
D) Snowpark ML's version does not require a warehouse
Answer: B — The Snowpark ML modeling API accepts Snowpark DataFrames directly and leverages Snowflake's distributed compute for training. A stored procedure using raw scikit-learn would need .to_pandas() to load data into memory, limiting scalability to the procedure's available RAM.
Question 8 — Built-in ML Functions
A business analyst with no Python experience needs to forecast daily sales for the next 30 days. Which Snowflake feature requires the least technical effort?
A) Train an ARIMA model in a Snowpark stored procedure
B) Use the built-in SNOWFLAKE.ML.FORECAST function
C) Deploy a Prophet model using Snowpark Container Services
D) Use Snowpark ML's TimeSeriesForecaster class
Answer: B — The built-in FORECAST function is SQL-only and requires no Python knowledge. It is the simplest approach for standard time-series forecasting. All other options require Python programming knowledge.
A) Train an ARIMA model in a Snowpark stored procedure
B) Use the built-in SNOWFLAKE.ML.FORECAST function
C) Deploy a Prophet model using Snowpark Container Services
D) Use Snowpark ML's TimeSeriesForecaster class
Answer: B — The built-in FORECAST function is SQL-only and requires no Python knowledge. It is the simplest approach for standard time-series forecasting. All other options require Python programming knowledge.
Question 9 — Model Training
When using Snowpark ML's Pipeline, what is the correct order of operations?
A) Train model, then preprocess features
B) Define pipeline steps, call fit() on training data, call predict() on test data
C) Call predict() first to determine which preprocessing is needed
D) Preprocessing and modeling must be done in separate pipelines
Answer: B — Snowpark ML Pipelines follow the scikit-learn pattern: define steps (preprocessing + modeling), fit the entire pipeline on training data, then predict on test data. The pipeline ensures preprocessing is applied consistently.
A) Train model, then preprocess features
B) Define pipeline steps, call fit() on training data, call predict() on test data
C) Call predict() first to determine which preprocessing is needed
D) Preprocessing and modeling must be done in separate pipelines
Answer: B — Snowpark ML Pipelines follow the scikit-learn pattern: define steps (preprocessing + modeling), fit the entire pipeline on training data, then predict on test data. The pipeline ensures preprocessing is applied consistently.
Question 10 — Model Training
A team wants to find the best hyperparameters for an XGBoost model using 5-fold cross-validation. Which Snowpark ML class should they use?
A) snowflake.ml.modeling.model_selection.GridSearchCV
B) snowflake.ml.modeling.preprocessing.StandardScaler
C) snowflake.ml.modeling.metrics.accuracy_score
D) snowflake.ml.registry.Registry
Answer: A — GridSearchCV performs exhaustive search over specified parameter combinations with cross-validation. StandardScaler is for feature scaling. accuracy_score is for evaluation. Registry is for model management.
A) snowflake.ml.modeling.model_selection.GridSearchCV
B) snowflake.ml.modeling.preprocessing.StandardScaler
C) snowflake.ml.modeling.metrics.accuracy_score
D) snowflake.ml.registry.Registry
Answer: A — GridSearchCV performs exhaustive search over specified parameter combinations with cross-validation. StandardScaler is for feature scaling. accuracy_score is for evaluation. Registry is for model management.
Question 11 — Model Registry
Which method is used to register a trained model in the Snowflake Model Registry?
A) registry.save_model()
B) registry.log_model()
C) registry.register_model()
D) registry.upload_model()
Answer: B — The
A) registry.save_model()
B) registry.log_model()
C) registry.register_model()
D) registry.upload_model()
Answer: B — The
log_model() method registers a trained model in the Snowflake Model Registry along with its version, dependencies, and metadata. The other method names are not part of the Registry API.
Question 12 — Model Deployment
A registered model needs to generate predictions on a Snowpark DataFrame. Which method should be called on the model version reference?
A) model_version.predict(df)
B) model_version.run(df, function_name="predict")
C) model_version.execute(df)
D) model_version.infer(df)
Answer: B — The
A) model_version.predict(df)
B) model_version.run(df, function_name="predict")
C) model_version.execute(df)
D) model_version.infer(df)
Answer: B — The
run() method with a specified function_name is the correct API for running inference with a registered model. You can specify "predict", "predict_proba", or "transform" depending on the model type.
Question 13 — Snowpark Container Services
When is Snowpark Container Services the preferred deployment option over UDF-based model serving? (Select the BEST answer)
A) When the model is a simple logistic regression
B) When GPU-accelerated inference is required for a large transformer model
C) When the model has no external package dependencies
D) When batch scoring runs once per month
Answer: B — SPCS is the right choice when you need GPU compute, custom Docker containers, or complex serving infrastructure. Large transformer models typically require GPU acceleration and custom serving frameworks that only SPCS can provide within Snowflake. Simple models work fine with UDFs or the Model Registry.
A) When the model is a simple logistic regression
B) When GPU-accelerated inference is required for a large transformer model
C) When the model has no external package dependencies
D) When batch scoring runs once per month
Answer: B — SPCS is the right choice when you need GPU compute, custom Docker containers, or complex serving infrastructure. Large transformer models typically require GPU acceleration and custom serving frameworks that only SPCS can provide within Snowflake. Simple models work fine with UDFs or the Model Registry.
Question 14 — Batch Inference
Which Snowflake feature should be used to automatically score a table with an ML model every day at 6 AM?
A) Snowflake Stream
B) Snowflake Task with a CRON schedule
C) Snowflake Alert
D) Snowflake Dynamic Table
Answer: B — Tasks with CRON schedules are designed for automated, recurring jobs. The task can execute SQL or call a stored procedure that performs batch scoring. Streams detect data changes. Alerts trigger on conditions. Dynamic Tables maintain materialized views.
A) Snowflake Stream
B) Snowflake Task with a CRON schedule
C) Snowflake Alert
D) Snowflake Dynamic Table
Answer: B — Tasks with CRON schedules are designed for automated, recurring jobs. The task can execute SQL or call a stored procedure that performs batch scoring. Streams detect data changes. Alerts trigger on conditions. Dynamic Tables maintain materialized views.
Question 15 — Snowpark Fundamentals
What happens when you call
A) Snowflake compresses the data efficiently and it works fine
B) The data is pulled to the client machine's memory, likely causing an out-of-memory error
C) Snowflake automatically samples the data to fit in memory
D) The operation is blocked by Snowflake's safety limits
Answer: B — Calling .to_pandas() transfers all data from Snowflake to the client machine's local memory as a pandas DataFrame. For 500GB, this will almost certainly exceed available RAM and crash. The best practice is to keep data in Snowpark DataFrames and use Snowpark ML operations.
.to_pandas() on a Snowpark DataFrame containing 500GB of data?A) Snowflake compresses the data efficiently and it works fine
B) The data is pulled to the client machine's memory, likely causing an out-of-memory error
C) Snowflake automatically samples the data to fit in memory
D) The operation is blocked by Snowflake's safety limits
Answer: B — Calling .to_pandas() transfers all data from Snowflake to the client machine's local memory as a pandas DataFrame. For 500GB, this will almost certainly exceed available RAM and crash. The best practice is to keep data in Snowpark DataFrames and use Snowpark ML operations.
Question 16 — Feature Engineering
Which SQL function is used to create label-encoded values from categorical columns in Snowflake?
A) ROW_NUMBER()
B) DENSE_RANK()
C) NTILE()
D) PERCENT_RANK()
Answer: B — DENSE_RANK() assigns consecutive integer ranks to distinct values, making it ideal for label encoding (e.g., city names to 0, 1, 2...). ROW_NUMBER assigns unique numbers to each row. NTILE divides into buckets. PERCENT_RANK gives percentile ranks.
A) ROW_NUMBER()
B) DENSE_RANK()
C) NTILE()
D) PERCENT_RANK()
Answer: B — DENSE_RANK() assigns consecutive integer ranks to distinct values, making it ideal for label encoding (e.g., city names to 0, 1, 2...). ROW_NUMBER assigns unique numbers to each row. NTILE divides into buckets. PERCENT_RANK gives percentile ranks.
Question 17 — Model Training
Which Snowflake built-in ML function would you use to identify unusual transactions in a financial dataset without labeled data?
A) SNOWFLAKE.ML.FORECAST
B) SNOWFLAKE.ML.ANOMALY_DETECTION
C) SNOWFLAKE.ML.CLASSIFICATION
D) SNOWFLAKE.ML.REGRESSION
Answer: B — ANOMALY_DETECTION is designed for identifying outliers and can operate in unsupervised mode (without labels). FORECAST is for time-series prediction. CLASSIFICATION requires labeled training data. There is no built-in REGRESSION function in this form.
A) SNOWFLAKE.ML.FORECAST
B) SNOWFLAKE.ML.ANOMALY_DETECTION
C) SNOWFLAKE.ML.CLASSIFICATION
D) SNOWFLAKE.ML.REGRESSION
Answer: B — ANOMALY_DETECTION is designed for identifying outliers and can operate in unsupervised mode (without labels). FORECAST is for time-series prediction. CLASSIFICATION requires labeled training data. There is no built-in REGRESSION function in this form.
Question 18 — Model Deployment
In Snowpark Container Services, what is a "compute pool"?
A) A SQL query execution plan
B) A set of Snowflake-managed compute nodes that run containerized applications
C) A virtual warehouse for SQL queries
D) A staging area for model artifacts
Answer: B — A compute pool is a collection of Snowflake-managed compute nodes (CPU or GPU) that host and run containerized services. You specify the instance family and auto-scaling parameters. It is distinct from virtual warehouses, which run SQL queries.
A) A SQL query execution plan
B) A set of Snowflake-managed compute nodes that run containerized applications
C) A virtual warehouse for SQL queries
D) A staging area for model artifacts
Answer: B — A compute pool is a collection of Snowflake-managed compute nodes (CPU or GPU) that host and run containerized services. You specify the instance family and auto-scaling parameters. It is distinct from virtual warehouses, which run SQL queries.
Question 19 — Feature Store
In the Snowflake Feature Store, what is an "Entity"?
A) A trained ML model
B) A primary key that features are associated with (e.g., customer_id)
C) A database schema
D) A compute resource for feature computation
Answer: B — An Entity in the Feature Store defines the primary key (join key) that features are associated with. For example, a "CUSTOMER" entity with join key "CUSTOMER_ID" links customer-level features. Entities enable correct joins when retrieving features for training or inference.
A) A trained ML model
B) A primary key that features are associated with (e.g., customer_id)
C) A database schema
D) A compute resource for feature computation
Answer: B — An Entity in the Feature Store defines the primary key (join key) that features are associated with. For example, a "CUSTOMER" entity with join key "CUSTOMER_ID" links customer-level features. Entities enable correct joins when retrieving features for training or inference.
Question 20 — Snowpark Fundamentals
How do you specify that a Snowpark UDF needs the scikit-learn package to execute on Snowflake?
A) Install scikit-learn on your local machine
B) Add scikit-learn to the packages parameter in the UDF decorator
C) Upload the scikit-learn source code to a stage
D) scikit-learn is always available by default
Answer: B — The packages parameter in the @udf or @sproc decorator specifies which Anaconda channel packages should be available in the execution environment. Snowflake resolves and caches these automatically. Local installations do not affect the Snowflake runtime.
A) Install scikit-learn on your local machine
B) Add scikit-learn to the packages parameter in the UDF decorator
C) Upload the scikit-learn source code to a stage
D) scikit-learn is always available by default
Answer: B — The packages parameter in the @udf or @sproc decorator specifies which Anaconda channel packages should be available in the execution environment. Snowflake resolves and caches these automatically. Local installations do not affect the Snowflake runtime.
Question 21 — Model Training
A model achieves high accuracy on training data but poor accuracy on validation data. What is this problem called, and what is the recommended fix?
A) Underfitting; add more features and use a more complex model
B) Overfitting; apply regularization, reduce model complexity, or add more training data
C) Data leakage; remove future features from training
D) Class imbalance; apply SMOTE oversampling
Answer: B — High training accuracy with poor validation accuracy is the classic sign of overfitting. The model memorizes training patterns instead of learning generalizable patterns. Fixes include regularization (L1/L2), reducing model complexity (fewer trees, shallower depth), dropout, or adding more training data.
A) Underfitting; add more features and use a more complex model
B) Overfitting; apply regularization, reduce model complexity, or add more training data
C) Data leakage; remove future features from training
D) Class imbalance; apply SMOTE oversampling
Answer: B — High training accuracy with poor validation accuracy is the classic sign of overfitting. The model memorizes training patterns instead of learning generalizable patterns. Fixes include regularization (L1/L2), reducing model complexity (fewer trees, shallower depth), dropout, or adding more training data.
Question 22 — Model Deployment
A data scientist wants to load a trained model file from a Snowflake stage inside a vectorized UDF. Which UDF parameter enables this?
A) packages
B) imports
C) dependencies
D) artifacts
Answer: B — The
A) packages
B) imports
C) dependencies
D) artifacts
Answer: B — The
imports parameter in the UDF decorator specifies stage files to make available in the UDF's execution environment. These files are downloaded to the UDF's working directory and can be loaded using standard file I/O. The packages parameter is for Anaconda packages, not stage files.
Question 23 — Feature Engineering
Which approach ensures consistent feature scaling between training and inference in Snowflake?
A) Compute scaling parameters independently for each batch during inference
B) Use Snowpark ML's StandardScaler, fit on training data, and reuse the fitted scaler
C) Hardcode mean and standard deviation values in the SQL query
D) Skip scaling since tree-based models do not require it
Answer: B — Fitting the StandardScaler on training data captures the mean and standard deviation. Reusing this fitted scaler for inference ensures consistent transformations. Independent scaling per batch (A) produces inconsistent features. Hardcoded values (C) are fragile. While tree-based models (D) are less sensitive, the question asks about consistent scaling in general.
A) Compute scaling parameters independently for each batch during inference
B) Use Snowpark ML's StandardScaler, fit on training data, and reuse the fitted scaler
C) Hardcode mean and standard deviation values in the SQL query
D) Skip scaling since tree-based models do not require it
Answer: B — Fitting the StandardScaler on training data captures the mean and standard deviation. Reusing this fitted scaler for inference ensures consistent transformations. Independent scaling per batch (A) produces inconsistent features. Hardcoded values (C) are fragile. While tree-based models (D) are less sensitive, the question asks about consistent scaling in general.
Question 24 — Model Monitoring
Which Snowflake feature is best suited for sending an automated notification when a model's prediction distribution drifts beyond a threshold?
A) Snowflake Stream
B) Snowflake Task
C) Snowflake Alert
D) Snowflake Dynamic Table
Answer: C — Snowflake Alerts evaluate a condition on a schedule and trigger an action (e.g., send email, call stored procedure) when the condition is met. This is ideal for monitoring drift thresholds. Tasks run on a schedule but do not have built-in conditional logic. Streams track data changes. Dynamic Tables maintain views.
A) Snowflake Stream
B) Snowflake Task
C) Snowflake Alert
D) Snowflake Dynamic Table
Answer: C — Snowflake Alerts evaluate a condition on a schedule and trigger an action (e.g., send email, call stored procedure) when the condition is met. This is ideal for monitoring drift thresholds. Tasks run on a schedule but do not have built-in conditional logic. Streams track data changes. Dynamic Tables maintain views.
Question 25 — End-to-End ML
A company wants to build a complete ML pipeline in Snowflake: daily feature computation, weekly model retraining, and real-time inference for a web application. Which combination of Snowflake features covers all three requirements?
A) Tasks for feature computation, stored procedure for training, UDFs for inference
B) Streams for feature computation, Tasks for training, Dynamic Tables for inference
C) Tasks for feature computation, Tasks for training, Snowpark Container Services for inference
D) External tools for all three, with data stored in Snowflake
Answer: C — Tasks with CRON schedules handle daily feature computation and weekly retraining. SPCS provides real-time HTTP endpoints for web application inference with auto-scaling. While UDFs (option A) work for SQL-embedded inference, a web application needs an HTTP endpoint, which SPCS provides. Option D defeats the purpose of Snowflake's ML capabilities.
A) Tasks for feature computation, stored procedure for training, UDFs for inference
B) Streams for feature computation, Tasks for training, Dynamic Tables for inference
C) Tasks for feature computation, Tasks for training, Snowpark Container Services for inference
D) External tools for all three, with data stored in Snowflake
Answer: C — Tasks with CRON schedules handle daily feature computation and weekly retraining. SPCS provides real-time HTTP endpoints for web application inference with auto-scaling. While UDFs (option A) work for SQL-embedded inference, a web application needs an HTTP endpoint, which SPCS provides. Option D defeats the purpose of Snowflake's ML capabilities.
Score Interpretation
20-25 Correct
Ready for the exam. You have strong command of all domains. Focus your remaining study time on any questions you got wrong.
15-19 Correct
Almost ready. Review the domains where you missed questions. Re-read the relevant lesson and try again in a few days.
Below 15
More study needed. Go back through the course lessons, focusing on hands-on practice with Snowpark ML and the model registry. Retake this exam when you feel more confident.
Lilly Tech Systems