Advanced

Practice Exam

25 exam-style questions covering all domains of the Databricks Machine Learning Professional exam. Try to answer each question before reading the explanation. Target time: 50 minutes (matching the real exam pace of 2 min/question).

💡

Exam simulation: Cover the answer explanations as you go. Write down your answers first, then check. A passing score would be roughly 18/25 correct (~70%).

Question 1 — MLflow Tracking

A data scientist runs the following code. What is logged to MLflow?

mlflow.autolog()
with mlflow.start_run():
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X_train, y_train)
    mlflow.log_metric("custom_score", 0.95)

A) Only the custom_score metric
B) Only the autolog parameters and metrics
C) Both autolog data (parameters, metrics, model) AND the custom_score metric, all in a single run
D) Two separate runs: one for autolog and one for custom_score

Answer: C — When mlflow.start_run() is active, autolog logs to that active run. The custom metric is also logged to the same run. Everything is captured in a single run context.

Question 2 — Feature Store

A feature table has primary_keys=["user_id"] and timestamp_keys=["date"]. Training labels have a label_date column. A FeatureLookup uses lookup_key="user_id". How does the Feature Store handle the timestamp join?

A) It performs an exact join on date = label_date
B) It performs an as-of join, selecting the most recent feature row where date <= label_date
C) It ignores the timestamp and uses the latest features for each user_id
D) It raises an error because timestamp column names do not match

Answer: B — The Feature Store automatically performs a point-in-time (as-of) join using the timestamp keys. It selects the most recent feature row that was available at or before the label timestamp, preventing future data leakage. The column names do not need to match.

Question 3 — Hyperopt

A team wants to maximize F1 score using Hyperopt. Their objective function returns {"loss": f1_score, "status": STATUS_OK}. What is wrong?

A) Nothing is wrong; Hyperopt will maximize the F1 score
B) Hyperopt minimizes loss, so they should return {"loss": -f1_score, "status": STATUS_OK}
C) They should use STATUS_FAIL instead of STATUS_OK
D) Hyperopt does not support F1 score as a metric

Answer: B — Hyperopt's fmin() always minimizes the loss value. To maximize F1 score, the objective must return the negative of the F1 score. Returning the positive F1 score would cause Hyperopt to find the parameters that produce the LOWEST F1 score.

Question 4 — Model Registry

A team uses the newer alias-based approach for model management. They want to load the current production model. Which URI is correct?

A) models:/fraud-detector/Production
B) models:/fraud-detector@production
C) models:/fraud-detector/latest
D) models:/fraud-detector@champion

Answer: D — With the alias approach, the convention is to use "champion" (not "Production") as the alias name for the production model. Option A uses the legacy stage-based URI. Option B uses lowercase "production" which is not a standard alias. Option C is not a valid format. The alias "champion" must have been previously set with set_registered_model_alias().

Question 5 — Model Serving

A serving endpoint receives JSON payloads. Which format correctly sends two records for scoring?

A) {"inputs": [[1, 2, 3], [4, 5, 6]]}
B) {"dataframe_records": [{"a": 1, "b": 2}, {"a": 4, "b": 5}]}
C) {"data": [{"a": 1, "b": 2}, {"a": 4, "b": 5}]}
D) {"rows": [{"a": 1, "b": 2}, {"a": 4, "b": 5}]}

Answer: B — Databricks Model Serving accepts dataframe_records (list of dictionaries) and dataframe_split (separate columns and data arrays) formats. The dataframe_records format is the most intuitive for sending multiple records.

Question 6 — AutoML

After running automl.classify(), a data scientist wants to modify the best model's training code. What should they do?

A) Decompile the model artifact and modify the code
B) Open the generated source notebook for the best trial and edit it directly
C) Use the AutoML API to modify hyperparameters and re-run
D) Export the model to ONNX format and modify there

Answer: B — Databricks AutoML is a glass-box solution that generates editable Python notebooks for each trial. The data scientist can open the best trial's notebook, modify the code (e.g., change features, add preprocessing, adjust parameters), and re-run it. This is one of AutoML's key differentiators.

Question 7 — Distributed Training

A team needs to train a PyTorch model across 4 GPUs on a Databricks cluster. Which approach is correct?

A) Use SparkTrials(parallelism=4) with Hyperopt
B) Use TorchDistributor(num_processes=4, use_gpu=True)
C) Convert the model to Spark ML and use pipeline.fit()
D) Train on each GPU separately and average the weights manually

Answer: B — TorchDistributor is Databricks' native integration for distributed PyTorch training across GPUs. It handles process spawning, communication, and GPU allocation. SparkTrials (A) distributes separate model instances for hyperparameter tuning, not a single distributed training. Spark ML (C) does not support PyTorch models.

Question 8 — DLT Expectations

A DLT pipeline processes financial data. Regulatory requirements mandate that if any row has a negative transaction amount, the entire pipeline must stop and alert the team. Which expectation is appropriate?

A) @dlt.expect("positive_amount", "amount > 0")
B) @dlt.expect_or_drop("positive_amount", "amount > 0")
C) @dlt.expect_or_fail("positive_amount", "amount > 0")
D) No expectation needed; handle in downstream processing

Answer: C — @dlt.expect_or_fail() stops the pipeline when any row violates the constraint. This is appropriate for regulatory or compliance scenarios where invalid data must not be processed. Option A only warns. Option B silently drops rows, which could hide compliance issues.

Question 9 — Spark ML Pipeline

A Spark ML Pipeline has stages: [StringIndexer, VectorAssembler, RandomForestClassifier]. After calling pipeline.fit(train_df), a data engineer calls pipeline_model.transform(test_df). What happens?

A) Only the RandomForestClassifier predictions are generated
B) Each stage transforms the data sequentially: indexing, assembling, then predicting
C) An error occurs because the pipeline was already fitted
D) The pipeline re-trains on the test data

Answer: B — A PipelineModel applies each fitted stage's transform() method sequentially. The StringIndexer maps strings to indices, VectorAssembler combines features, and the fitted RandomForestClassificationModel generates predictions. No re-training occurs during transform().

Question 10 — Workflow Task Values

Q10

In a Databricks Workflow with tasks "train" and "evaluate", the train task sets dbutils.jobs.taskValues.set(key="accuracy", value=0.92). How does the evaluate task read this value?

A) dbutils.jobs.taskValues.get(key="accuracy")
B) dbutils.jobs.taskValues.get(taskKey="train", key="accuracy")
C) dbutils.widgets.get("accuracy")
D) spark.conf.get("accuracy")

Answer: B — Reading a task value requires specifying both the taskKey (the name of the task that set the value) and the key. Option A is missing the taskKey parameter. Options C and D use unrelated mechanisms.

Question 11 — Feature Store Serving

Q11

A model was trained using Feature Store with fs.create_training_set() and logged with fs.log_model(). At serving time, the endpoint receives only a customer_id. How are the features obtained?

A) The endpoint fails because all features must be provided in the request
B) The endpoint automatically looks up features from the online Feature Store using the customer_id
C) The endpoint uses cached features from the training data
D) The endpoint queries the offline Delta table directly

Answer: B — When a model is logged with fs.log_model(), the Feature Store metadata (feature lookups) is packaged with the model. At serving time, the endpoint uses the primary key (customer_id) to look up features from the published online store. This is why publishing features to an online store is required for real-time serving.

Question 12 — Batch Inference

Q12

A team needs to score 50 million records nightly using a model from the registry. Which approach minimizes cost and maximizes throughput?

A) Use a Model Serving endpoint and send records in batches via REST API
B) Use mlflow.pyfunc.spark_udf() applied to a Spark DataFrame in a scheduled Workflow
C) Load the model on a single node and iterate through records with pandas
D) Use fs.score_batch() with a Feature Store model that automatically joins features

Answer: B — For large-scale batch inference, spark_udf() distributes the scoring across the cluster and is the most cost-effective approach. Option D is also valid if the model uses Feature Store, but the question does not mention Feature Store. Option A is designed for real-time, not batch. Option C cannot handle 50M records efficiently.

Question 13 — MLflow Model Flavors

Q13

A data scientist trained a model using XGBoost and logged it with mlflow.xgboost.log_model(). They now want to load it as a generic Python function for serving. Which code is correct?

A) mlflow.xgboost.load_model(model_uri)
B) mlflow.pyfunc.load_model(model_uri)
C) mlflow.sklearn.load_model(model_uri)
D) mlflow.load_model(model_uri, flavor="pyfunc")

Answer: B — mlflow.pyfunc.load_model() loads any MLflow model as a generic Python function with a standard predict() interface. This is the recommended approach for serving because it provides a consistent API regardless of the original framework. Option A loads the native XGBoost model. Option D is not valid syntax.

Question 14 — Hyperopt Search Space

Q14

A team needs to tune a learning rate that should be sampled between 0.0001 and 0.1 on a logarithmic scale. Which Hyperopt expression is correct?

A) hp.uniform("lr", 0.0001, 0.1)
B) hp.loguniform("lr", np.log(0.0001), np.log(0.1))
C) hp.choice("lr", [0.0001, 0.001, 0.01, 0.1])
D) hp.quniform("lr", 0.0001, 0.1, 0.001)

Answer: B — hp.loguniform(low, high) samples from a log-uniform distribution where the returned value is exp(uniform(low, high)). The bounds must be provided in log-space: np.log(0.0001) to np.log(0.1). This is ideal for learning rates that span multiple orders of magnitude. Option A uses linear uniform, which would heavily bias toward larger values.

Question 15 — Model Monitoring

Q15

A serving endpoint has inference logging enabled. What data is automatically captured in the inference table?

A) Only the prediction output
B) Input features, predictions, timestamps, and request metadata
C) Only input features and timestamps
D) Model weights and gradient updates

Answer: B — Inference tables automatically capture the complete request-response cycle: input features, model predictions, timestamps, request IDs, model version, and latency. This data enables drift detection, debugging, and compliance monitoring. Model internals like weights (D) are not captured.

Question 16 — CI/CD

Q16

A team uses Databricks Asset Bundles for deployment. Which command deploys resources to the production target?

A) databricks jobs create --target production
B) databricks bundle deploy --target production
C) databricks repos update --branch main
D) databricks workspace import --target production

Answer: B — Databricks Asset Bundles use databricks bundle deploy --target <target> to deploy resources defined in YAML configuration files. The target (e.g., "production", "staging") is defined in the bundle configuration. This is the recommended infrastructure-as-code approach for Databricks.

Question 17 — Feature Table Management

Q17

A team needs to add a new feature column to an existing Feature Store table without losing historical data. What is the correct approach?

A) Drop and recreate the feature table with the new column
B) Use fs.write_table() with the new column included; the table schema evolves automatically
C) Create a new feature table with the additional column and deprecate the old one
D) Use SQL ALTER TABLE to add the column, then write new data

Answer: B — The Feature Store supports schema evolution. When you write a DataFrame with a new column using fs.write_table(), the feature table schema is automatically updated to include the new column. Existing rows will have null values for the new column. This preserves all historical data.

Question 18 — MLflow Experiment Organization

Q18

A team of 5 data scientists works on the same churn prediction project. They want all their experiments to be visible to the team but organized by individual. What is the recommended approach?

A) Each scientist creates their own experiment with their name in the path
B) Use a single shared experiment and organize runs using tags (e.g., mlflow.set_tag("scientist", "Alice"))
C) Create separate MLflow tracking servers for each scientist
D) Use Databricks workspace folders with individual permissions

Answer: B — A single shared experiment with tags is the recommended approach for team collaboration. Tags enable filtering and searching runs by scientist, approach, or any other dimension. This keeps all results in one place for easy comparison while maintaining organizational clarity.

Question 19 — Spark UDF Inference

Q19

When using mlflow.pyfunc.spark_udf(spark, model_uri), how is the model distributed across the cluster?

A) The model is trained on each worker independently
B) The model is serialized and broadcast to all workers; each worker loads it once for local inference
C) Each worker fetches the model from MLflow independently
D) The model runs only on the driver and results are sent to workers

Answer: B — The Spark UDF serializes the model and broadcasts it to all workers via Spark's broadcast mechanism. Each worker deserializes and caches the model locally, then applies it to its partition of data. This is efficient because the model is transferred once and reused across all rows in the partition.

Question 20 — Traffic Splitting

Q20

A serving endpoint has two models: champion (v3) with 80% traffic and challenger (v5) with 20% traffic. After validating that v5 performs better, what is the most efficient way to promote v5 to 100% traffic?

A) Delete the endpoint and create a new one with only v5
B) Update the endpoint's traffic config to route 100% to v5 and remove v3
C) Transition v5 to Production stage in the Model Registry
D) Create a new endpoint with v5 and update the DNS

Answer: B — Update the traffic configuration on the existing endpoint to route 100% traffic to v5. This is a configuration change that does not require endpoint recreation. You can then remove v3 from the served entities. This is zero-downtime and the recommended approach.

Question 21 — Online Feature Store

Q21

A model served via a Databricks endpoint needs features from the Feature Store at prediction time. The features must be available with sub-100ms latency. What must be configured?

A) Index the offline Delta table for faster queries
B) Publish the feature table to an online store (e.g., DynamoDB or Cosmos DB) and log the model with fs.log_model()
C) Cache the feature table in Spark memory on the serving cluster
D) Pre-compute all possible feature combinations and store in the model artifact

Answer: B — For sub-100ms latency, features must be published to an online store optimized for key-value lookups. The model must be logged with fs.log_model() so the serving endpoint knows which features to look up and from which online store. The offline Delta table is not designed for real-time lookups.

Question 22 — Nested Runs

Q22

During hyperparameter tuning, a team wants each trial logged as a child run under a parent run. What is the correct pattern?

A) Call mlflow.start_run() for each trial without nesting
B) Use mlflow.start_run(nested=True) inside the parent run context
C) Use mlflow.start_run(parent_id=parent_run_id)
D) Create separate experiments for the parent and children

Answer: B — mlflow.start_run(nested=True) creates a child run under the currently active parent run. This is the standard pattern for hyperparameter tuning: one parent run for the overall search, and nested child runs for each trial. This keeps the experiment organized and easy to navigate.

Question 23 — Lakehouse Monitoring

Q23

A Lakehouse Monitor is configured on an inference table with drift detection. Which type of drift does it detect by default?

A) Only concept drift (change in relationship between features and target)
B) Only data drift (change in input feature distributions)
C) Both data drift and prediction drift, using statistical tests against a baseline
D) Only prediction drift (change in model output distribution)

Answer: C — Lakehouse Monitoring detects both data drift (changes in input feature distributions) and prediction drift (changes in model output distribution) by comparing current data against a baseline window using statistical tests. Concept drift requires ground truth labels and is tracked separately when labels are available.

Question 24 — Workflow Scheduling

Q24

An ML workflow needs to: (1) refresh features via DLT, (2) retrain the model, (3) evaluate and conditionally deploy. The feature refresh must complete before training starts, but evaluation depends on training. Which Workflow configuration is correct?

A) Three independent parallel tasks
B) A linear chain: DLT task → training task → evaluation task, with task dependencies
C) DLT and training in parallel, then evaluation
D) A single notebook that runs all three steps sequentially

Answer: B — Databricks Workflows supports task dependencies. Configure the training task to depend on the DLT task, and the evaluation task to depend on the training task. This ensures correct execution order with proper error handling. A single notebook (D) would work but lacks the modularity, retry capabilities, and monitoring of separate tasks.

Question 25 — Model Signature

Q25

A data scientist logs a model without specifying an input signature. What happens when the model is deployed to a serving endpoint and receives an unexpected column?

A) The endpoint rejects the request with a schema validation error
B) The endpoint passes all columns to the model, which may fail or produce incorrect results
C) The endpoint automatically removes unexpected columns
D) The serving endpoint cannot be created without a model signature

Answer: B — Without a model signature, the serving endpoint does not validate input data against an expected schema. Unexpected columns are passed directly to the model. This can cause errors or silent failures. Best practice is to always log models with an explicit signature using mlflow.models.infer_signature() for input validation at serving time.

Score Your Results

💡

18-25 correct (72-100%): You are likely ready for the real exam. Review any missed questions and schedule your exam.

14-17 correct (56-68%): You are close but need more review. Focus on your weakest domains and retake in a few days.

Below 14 correct (<56%): Go back to the domain lessons and study the areas where you missed the most questions. Do not schedule the real exam until you score 18+ consistently.

← Previous ML Pipelines & Automation Next → Exam Tips & Review