Advanced

Practice Exam

25 exam-style questions covering all domains of the Snowflake ML Specialization exam. Try to answer each question before reading the explanation. Target time: 45 minutes (matching the real exam pace of ~1.75 min/question).

💡
Exam simulation: Cover the answer explanations as you go. Write down your answers first, then check. A passing score would be roughly 18/25 correct (72%).

Question 1 — Snowpark Fundamentals

Q1
A data scientist creates a Snowpark DataFrame by calling session.table("CUSTOMERS") and then chains multiple filter and select operations. When does the query execute on the Snowflake warehouse?

A) Immediately when session.table() is called
B) After each filter or select operation
C) Only when an action like .collect(), .show(), or .count() is called
D) When the Python script finishes executing

Answer: C — Snowpark uses lazy evaluation. DataFrame operations (filter, select, join) build a query plan but do not execute. The query runs on the warehouse only when an action (collect, show, count, write, to_pandas) is triggered.

Question 2 — Snowpark UDFs

Q2
A team needs to apply a custom ML scoring function to 200 million rows. The function uses NumPy for matrix operations. Which UDF type provides the best performance?

A) Scalar UDF
B) Vectorized UDF (pandas_udf)
C) JavaScript UDF
D) SQL UDF

Answer: B — Vectorized UDFs process data in Arrow-based batches using pandas Series, which is 10-100x faster than row-by-row scalar UDFs for numerical operations. NumPy operations on batches are highly optimized compared to row-by-row processing.

Question 3 — Snowpark Stored Procedures

Q3
Which statement about Snowpark stored procedures for ML is TRUE?

A) Stored procedures can only use SQL, not Python
B) Stored procedures execute on the client machine
C) Stored procedures can import third-party Python packages from the Anaconda channel
D) Stored procedures cannot access Snowflake stages

Answer: C — Snowpark stored procedures support Python and can import packages from Snowflake's curated Anaconda channel using the packages parameter. They execute on Snowflake's compute (not the client) and have full access to stages for reading/writing files.

Question 4 — Feature Engineering

Q4
A data scientist needs to calculate a 7-day rolling average of transaction amounts per customer. Which SQL approach is most efficient?

A) A self-join with a date range filter
B) A window function with ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
C) A stored procedure that loops through each customer
D) A UDTF that computes the rolling average

Answer: B — SQL window functions execute natively on Snowflake's distributed engine and are optimized for this exact pattern. Self-joins are less efficient, stored procedure loops are procedural and slow, and a UDTF adds unnecessary complexity.

Question 5 — Feature Store

Q5
What does the Snowflake Feature Store's point-in-time join prevent?

A) Duplicate features in the store
B) Unauthorized access to features
C) Data leakage from future data into training features
D) Features with missing values

Answer: C — Point-in-time joins ensure that when retrieving features for training, only feature values available at the time of each training example are used. This prevents future data from leaking into historical training examples, which would produce overly optimistic models.

Question 6 — Feature Engineering

Q6
Which Snowpark ML class should be used to replace missing values in a DataFrame with the median of each column?

A) snowflake.ml.modeling.preprocessing.StandardScaler
B) snowflake.ml.modeling.impute.SimpleImputer with strategy="median"
C) snowflake.ml.modeling.preprocessing.OneHotEncoder
D) snowflake.ml.modeling.preprocessing.MinMaxScaler

Answer: B — SimpleImputer with strategy="median" replaces missing values with the median of each column. StandardScaler normalizes features. OneHotEncoder converts categoricals. MinMaxScaler scales to a range. Only SimpleImputer handles missing data.

Question 7 — Model Training

Q7
What is the key advantage of using snowflake.ml.modeling.ensemble.RandomForestClassifier over running scikit-learn's RandomForestClassifier in a stored procedure?

A) Snowpark ML's version supports more hyperparameters
B) Snowpark ML's version trains on distributed Snowflake compute without requiring .to_pandas()
C) Snowpark ML's version produces more accurate models
D) Snowpark ML's version does not require a warehouse

Answer: B — The Snowpark ML modeling API accepts Snowpark DataFrames directly and leverages Snowflake's distributed compute for training. A stored procedure using raw scikit-learn would need .to_pandas() to load data into memory, limiting scalability to the procedure's available RAM.

Question 8 — Built-in ML Functions

Q8
A business analyst with no Python experience needs to forecast daily sales for the next 30 days. Which Snowflake feature requires the least technical effort?

A) Train an ARIMA model in a Snowpark stored procedure
B) Use the built-in SNOWFLAKE.ML.FORECAST function
C) Deploy a Prophet model using Snowpark Container Services
D) Use Snowpark ML's TimeSeriesForecaster class

Answer: B — The built-in FORECAST function is SQL-only and requires no Python knowledge. It is the simplest approach for standard time-series forecasting. All other options require Python programming knowledge.

Question 9 — Model Training

Q9
When using Snowpark ML's Pipeline, what is the correct order of operations?

A) Train model, then preprocess features
B) Define pipeline steps, call fit() on training data, call predict() on test data
C) Call predict() first to determine which preprocessing is needed
D) Preprocessing and modeling must be done in separate pipelines

Answer: B — Snowpark ML Pipelines follow the scikit-learn pattern: define steps (preprocessing + modeling), fit the entire pipeline on training data, then predict on test data. The pipeline ensures preprocessing is applied consistently.

Question 10 — Model Training

Q10
A team wants to find the best hyperparameters for an XGBoost model using 5-fold cross-validation. Which Snowpark ML class should they use?

A) snowflake.ml.modeling.model_selection.GridSearchCV
B) snowflake.ml.modeling.preprocessing.StandardScaler
C) snowflake.ml.modeling.metrics.accuracy_score
D) snowflake.ml.registry.Registry

Answer: A — GridSearchCV performs exhaustive search over specified parameter combinations with cross-validation. StandardScaler is for feature scaling. accuracy_score is for evaluation. Registry is for model management.

Question 11 — Model Registry

Q11
Which method is used to register a trained model in the Snowflake Model Registry?

A) registry.save_model()
B) registry.log_model()
C) registry.register_model()
D) registry.upload_model()

Answer: B — The log_model() method registers a trained model in the Snowflake Model Registry along with its version, dependencies, and metadata. The other method names are not part of the Registry API.

Question 12 — Model Deployment

Q12
A registered model needs to generate predictions on a Snowpark DataFrame. Which method should be called on the model version reference?

A) model_version.predict(df)
B) model_version.run(df, function_name="predict")
C) model_version.execute(df)
D) model_version.infer(df)

Answer: B — The run() method with a specified function_name is the correct API for running inference with a registered model. You can specify "predict", "predict_proba", or "transform" depending on the model type.

Question 13 — Snowpark Container Services

Q13
When is Snowpark Container Services the preferred deployment option over UDF-based model serving? (Select the BEST answer)

A) When the model is a simple logistic regression
B) When GPU-accelerated inference is required for a large transformer model
C) When the model has no external package dependencies
D) When batch scoring runs once per month

Answer: B — SPCS is the right choice when you need GPU compute, custom Docker containers, or complex serving infrastructure. Large transformer models typically require GPU acceleration and custom serving frameworks that only SPCS can provide within Snowflake. Simple models work fine with UDFs or the Model Registry.

Question 14 — Batch Inference

Q14
Which Snowflake feature should be used to automatically score a table with an ML model every day at 6 AM?

A) Snowflake Stream
B) Snowflake Task with a CRON schedule
C) Snowflake Alert
D) Snowflake Dynamic Table

Answer: B — Tasks with CRON schedules are designed for automated, recurring jobs. The task can execute SQL or call a stored procedure that performs batch scoring. Streams detect data changes. Alerts trigger on conditions. Dynamic Tables maintain materialized views.

Question 15 — Snowpark Fundamentals

Q15
What happens when you call .to_pandas() on a Snowpark DataFrame containing 500GB of data?

A) Snowflake compresses the data efficiently and it works fine
B) The data is pulled to the client machine's memory, likely causing an out-of-memory error
C) Snowflake automatically samples the data to fit in memory
D) The operation is blocked by Snowflake's safety limits

Answer: B — Calling .to_pandas() transfers all data from Snowflake to the client machine's local memory as a pandas DataFrame. For 500GB, this will almost certainly exceed available RAM and crash. The best practice is to keep data in Snowpark DataFrames and use Snowpark ML operations.

Question 16 — Feature Engineering

Q16
Which SQL function is used to create label-encoded values from categorical columns in Snowflake?

A) ROW_NUMBER()
B) DENSE_RANK()
C) NTILE()
D) PERCENT_RANK()

Answer: B — DENSE_RANK() assigns consecutive integer ranks to distinct values, making it ideal for label encoding (e.g., city names to 0, 1, 2...). ROW_NUMBER assigns unique numbers to each row. NTILE divides into buckets. PERCENT_RANK gives percentile ranks.

Question 17 — Model Training

Q17
Which Snowflake built-in ML function would you use to identify unusual transactions in a financial dataset without labeled data?

A) SNOWFLAKE.ML.FORECAST
B) SNOWFLAKE.ML.ANOMALY_DETECTION
C) SNOWFLAKE.ML.CLASSIFICATION
D) SNOWFLAKE.ML.REGRESSION

Answer: B — ANOMALY_DETECTION is designed for identifying outliers and can operate in unsupervised mode (without labels). FORECAST is for time-series prediction. CLASSIFICATION requires labeled training data. There is no built-in REGRESSION function in this form.

Question 18 — Model Deployment

Q18
In Snowpark Container Services, what is a "compute pool"?

A) A SQL query execution plan
B) A set of Snowflake-managed compute nodes that run containerized applications
C) A virtual warehouse for SQL queries
D) A staging area for model artifacts

Answer: B — A compute pool is a collection of Snowflake-managed compute nodes (CPU or GPU) that host and run containerized services. You specify the instance family and auto-scaling parameters. It is distinct from virtual warehouses, which run SQL queries.

Question 19 — Feature Store

Q19
In the Snowflake Feature Store, what is an "Entity"?

A) A trained ML model
B) A primary key that features are associated with (e.g., customer_id)
C) A database schema
D) A compute resource for feature computation

Answer: B — An Entity in the Feature Store defines the primary key (join key) that features are associated with. For example, a "CUSTOMER" entity with join key "CUSTOMER_ID" links customer-level features. Entities enable correct joins when retrieving features for training or inference.

Question 20 — Snowpark Fundamentals

Q20
How do you specify that a Snowpark UDF needs the scikit-learn package to execute on Snowflake?

A) Install scikit-learn on your local machine
B) Add scikit-learn to the packages parameter in the UDF decorator
C) Upload the scikit-learn source code to a stage
D) scikit-learn is always available by default

Answer: B — The packages parameter in the @udf or @sproc decorator specifies which Anaconda channel packages should be available in the execution environment. Snowflake resolves and caches these automatically. Local installations do not affect the Snowflake runtime.

Question 21 — Model Training

Q21
A model achieves high accuracy on training data but poor accuracy on validation data. What is this problem called, and what is the recommended fix?

A) Underfitting; add more features and use a more complex model
B) Overfitting; apply regularization, reduce model complexity, or add more training data
C) Data leakage; remove future features from training
D) Class imbalance; apply SMOTE oversampling

Answer: B — High training accuracy with poor validation accuracy is the classic sign of overfitting. The model memorizes training patterns instead of learning generalizable patterns. Fixes include regularization (L1/L2), reducing model complexity (fewer trees, shallower depth), dropout, or adding more training data.

Question 22 — Model Deployment

Q22
A data scientist wants to load a trained model file from a Snowflake stage inside a vectorized UDF. Which UDF parameter enables this?

A) packages
B) imports
C) dependencies
D) artifacts

Answer: B — The imports parameter in the UDF decorator specifies stage files to make available in the UDF's execution environment. These files are downloaded to the UDF's working directory and can be loaded using standard file I/O. The packages parameter is for Anaconda packages, not stage files.

Question 23 — Feature Engineering

Q23
Which approach ensures consistent feature scaling between training and inference in Snowflake?

A) Compute scaling parameters independently for each batch during inference
B) Use Snowpark ML's StandardScaler, fit on training data, and reuse the fitted scaler
C) Hardcode mean and standard deviation values in the SQL query
D) Skip scaling since tree-based models do not require it

Answer: B — Fitting the StandardScaler on training data captures the mean and standard deviation. Reusing this fitted scaler for inference ensures consistent transformations. Independent scaling per batch (A) produces inconsistent features. Hardcoded values (C) are fragile. While tree-based models (D) are less sensitive, the question asks about consistent scaling in general.

Question 24 — Model Monitoring

Q24
Which Snowflake feature is best suited for sending an automated notification when a model's prediction distribution drifts beyond a threshold?

A) Snowflake Stream
B) Snowflake Task
C) Snowflake Alert
D) Snowflake Dynamic Table

Answer: C — Snowflake Alerts evaluate a condition on a schedule and trigger an action (e.g., send email, call stored procedure) when the condition is met. This is ideal for monitoring drift thresholds. Tasks run on a schedule but do not have built-in conditional logic. Streams track data changes. Dynamic Tables maintain views.

Question 25 — End-to-End ML

Q25
A company wants to build a complete ML pipeline in Snowflake: daily feature computation, weekly model retraining, and real-time inference for a web application. Which combination of Snowflake features covers all three requirements?

A) Tasks for feature computation, stored procedure for training, UDFs for inference
B) Streams for feature computation, Tasks for training, Dynamic Tables for inference
C) Tasks for feature computation, Tasks for training, Snowpark Container Services for inference
D) External tools for all three, with data stored in Snowflake

Answer: C — Tasks with CRON schedules handle daily feature computation and weekly retraining. SPCS provides real-time HTTP endpoints for web application inference with auto-scaling. While UDFs (option A) work for SQL-embedded inference, a web application needs an HTTP endpoint, which SPCS provides. Option D defeats the purpose of Snowflake's ML capabilities.

Score Interpretation

20-25 Correct

Ready for the exam. You have strong command of all domains. Focus your remaining study time on any questions you got wrong.

15-19 Correct

Almost ready. Review the domains where you missed questions. Re-read the relevant lesson and try again in a few days.

Below 15

More study needed. Go back through the course lessons, focusing on hands-on practice with Snowpark ML and the model registry. Retake this exam when you feel more confident.