Intermediate

Model Deployment

Master Databricks Model Serving, batch inference, real-time endpoints, and Model Registry lifecycle management — covering approximately 30% of the exam (Model Lifecycle Management domain).

MLflow Model Registry

The Model Registry is a centralized model store that provides model versioning, stage transitions, and approval workflows. It is the bridge between experimentation and production.

Model Registry Concepts

  • Registered Model — A named model in the registry (e.g., "churn-predictor")
  • Model Version — A specific version of a registered model, linked to an MLflow run
  • Stages — Lifecycle stages: None, Staging, Production, Archived
  • Aliases — Named references to specific versions (e.g., "champion", "challenger") — the newer approach replacing stages
  • Tags — Key-value metadata on models and versions for organization

Registering a Model

import mlflow

# Method 1: Register during logging
with mlflow.start_run():
    mlflow.sklearn.log_model(
        model,
        "model",
        registered_model_name="churn-predictor"
    )

# Method 2: Register an existing run's model
result = mlflow.register_model(
    model_uri=f"runs:/{run_id}/model",
    name="churn-predictor"
)
print(f"Version: {result.version}")

Stage Transitions

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Transition a model version to Production
client.transition_model_version_stage(
    name="churn-predictor",
    version=3,
    stage="Production",
    archive_existing_versions=True  # archives current Production version
)
💡
Exam tip: Databricks is transitioning from stages (None/Staging/Production/Archived) to aliases (arbitrary named references like "champion"/"challenger"). Know both approaches. With aliases, you use client.set_registered_model_alias("churn-predictor", "champion", version=3).

Loading Models from Registry

# Load by stage (legacy)
model = mlflow.pyfunc.load_model("models:/churn-predictor/Production")

# Load by alias (recommended)
model = mlflow.pyfunc.load_model("models:/churn-predictor@champion")

# Load by version number
model = mlflow.pyfunc.load_model("models:/churn-predictor/3")

Databricks Model Serving

Model Serving provides managed, real-time REST API endpoints for MLflow models. It handles auto-scaling, load balancing, and infrastructure management.

Creating a Serving Endpoint

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import (
    EndpointCoreConfigInput,
    ServedEntityInput
)

w = WorkspaceClient()

# Create endpoint with a single served model
w.serving_endpoints.create(
    name="churn-predictor-endpoint",
    config=EndpointCoreConfigInput(
        served_entities=[
            ServedEntityInput(
                entity_name="churn-predictor",
                entity_version="3",
                workload_size="Small",
                scale_to_zero_enabled=True
            )
        ]
    )
)

Querying the Endpoint

import requests
import json

# Score a single record
url = "https://<workspace-url>/serving-endpoints/churn-predictor-endpoint/invocations"
headers = {"Authorization": f"Bearer {token}", "Content-Type": "application/json"}

data = {
    "dataframe_records": [
        {"total_purchases": 15, "avg_session_duration": 320, "days_since_last_visit": 7}
    ]
}

response = requests.post(url, headers=headers, json=data)
predictions = response.json()
Exam concept: Model Serving endpoints accept data in two formats: dataframe_records (list of row dictionaries) and dataframe_split (columns + data arrays). Know the difference. Also note that scale_to_zero_enabled=True means the endpoint can scale down to zero replicas when idle, reducing cost but adding cold-start latency.

Traffic Splitting (A/B Testing)

You can route traffic between multiple model versions for A/B testing:

config = EndpointCoreConfigInput(
    served_entities=[
        ServedEntityInput(
            name="champion",
            entity_name="churn-predictor",
            entity_version="3",
            workload_size="Small",
            scale_to_zero_enabled=False
        ),
        ServedEntityInput(
            name="challenger",
            entity_name="churn-predictor",
            entity_version="4",
            workload_size="Small",
            scale_to_zero_enabled=False
        )
    ],
    traffic_config=TrafficConfig(
        routes=[
            Route(served_model_name="champion", traffic_percentage=90),
            Route(served_model_name="challenger", traffic_percentage=10)
        ]
    )
)

Batch Inference

For scoring large datasets, batch inference is more cost-effective than real-time serving:

Using MLflow pyfunc with Spark

import mlflow

# Load model as a Spark UDF
predict_udf = mlflow.pyfunc.spark_udf(
    spark,
    model_uri="models:/churn-predictor@champion"
)

# Apply to a Spark DataFrame
predictions_df = input_df.withColumn(
    "prediction",
    predict_udf("total_purchases", "avg_session_duration", "days_since_last_visit")
)

# Write predictions to Delta table
predictions_df.write.format("delta").mode("overwrite").saveAsTable("predictions.churn_scores")
💡
Exam tip: mlflow.pyfunc.spark_udf() converts any MLflow model into a Spark UDF for distributed batch inference. This is the recommended approach for scoring large datasets. The model is serialized and broadcast to each Spark worker. Know that it requires the spark session and a model URI.

Feature Store Batch Scoring

from databricks.feature_store import FeatureStoreClient

fs = FeatureStoreClient()

# Score with automatic feature lookup
predictions = fs.score_batch(
    model_uri="models:/churn-predictor@champion",
    df=new_customers_df  # only needs primary key columns
)
# Feature Store automatically joins the required features

Model Monitoring

After deployment, monitor model performance to detect drift and degradation:

  • Databricks Lakehouse Monitoring — Automated profiling and drift detection for Delta tables
  • Inference tables — Automatically log all prediction requests and responses from serving endpoints
  • Data drift — Statistical comparison of input feature distributions between training and serving data
  • Model quality — Compare prediction accuracy against ground truth labels (when available)

Practice Questions


Question 1 — Model Registry

Q1
A team has version 3 in Production and wants to promote version 5 to Production while automatically archiving version 3. Which code is correct?

A) client.transition_model_version_stage("model", version=5, stage="Production")
B) client.transition_model_version_stage("model", version=5, stage="Production", archive_existing_versions=True)
C) client.promote_model_version("model", version=5)
D) client.set_registered_model_alias("model", "Production", version=5)

Answer: B — The archive_existing_versions=True parameter automatically moves the current Production version (v3) to Archived when the new version (v5) is promoted. Without this parameter, both versions would be in Production simultaneously. Option C does not exist. Option D uses aliases, which is valid but uses a different paradigm than stages.

Question 2 — Model Serving

Q2
A serving endpoint has scale_to_zero_enabled=True. A request arrives after 30 minutes of inactivity. What happens?

A) The request fails with a timeout error
B) The endpoint scales up from zero, causing a cold-start delay, then serves the request
C) The request is queued indefinitely until the endpoint is manually restarted
D) The endpoint is deleted after inactivity and must be recreated

Answer: B — When scale_to_zero_enabled=True, the endpoint reduces to zero replicas during inactivity to save costs. When a new request arrives, it automatically scales up, which introduces cold-start latency (typically 30-120 seconds). The request is served after the endpoint is ready. This is ideal for intermittent workloads where cost matters more than latency.

Question 3 — Batch Inference

Q3
A data engineer needs to score 100 million rows using a registered MLflow model. Which approach is most efficient on Databricks?

A) Load the model with mlflow.pyfunc.load_model() and iterate through rows on the driver
B) Send all rows to the Model Serving endpoint using the REST API
C) Use mlflow.pyfunc.spark_udf() and apply it to a Spark DataFrame
D) Export the data to CSV and score it locally

Answer: Cmlflow.pyfunc.spark_udf() distributes the scoring across the entire Spark cluster. The model is broadcast to each worker, and predictions are computed in parallel. Option A runs on a single node and cannot handle 100M rows efficiently. Option B is for real-time, not batch. Option D is impractical at this scale.

Question 4 — Traffic Splitting

Q4
A team wants to test a new model version by sending 10% of traffic to it while 90% goes to the current champion. How should they configure the serving endpoint?

A) Create two separate endpoints and use a load balancer
B) Configure traffic routes on a single endpoint with two served entities and 90/10 split
C) Deploy the new version to Staging and the old to Production in the Model Registry
D) Use feature flags in the application code to route requests

Answer: B — Databricks Model Serving natively supports traffic splitting with multiple served entities on a single endpoint. Configure two served models with traffic_percentage values of 90 and 10. This is the built-in A/B testing mechanism. No external load balancer or application-level routing is needed.

Question 5 — Model Loading

Q5
Which model URI format uses the newer alias approach (instead of stages) to load the production model?

A) models:/churn-predictor/Production
B) models:/churn-predictor@champion
C) models:/churn-predictor/latest
D) runs:/<run_id>/model

Answer: B — The @alias syntax (e.g., models:/model-name@champion) is the newer approach that uses model aliases instead of stage-based references. Option A uses the legacy stage-based URI. Option C is not a valid format. Option D loads from a specific run, not the registry.