Model Deployment
Master Databricks Model Serving, batch inference, real-time endpoints, and Model Registry lifecycle management — covering approximately 30% of the exam (Model Lifecycle Management domain).
MLflow Model Registry
The Model Registry is a centralized model store that provides model versioning, stage transitions, and approval workflows. It is the bridge between experimentation and production.
Model Registry Concepts
- Registered Model — A named model in the registry (e.g., "churn-predictor")
- Model Version — A specific version of a registered model, linked to an MLflow run
- Stages — Lifecycle stages:
None,Staging,Production,Archived - Aliases — Named references to specific versions (e.g., "champion", "challenger") — the newer approach replacing stages
- Tags — Key-value metadata on models and versions for organization
Registering a Model
import mlflow
# Method 1: Register during logging
with mlflow.start_run():
mlflow.sklearn.log_model(
model,
"model",
registered_model_name="churn-predictor"
)
# Method 2: Register an existing run's model
result = mlflow.register_model(
model_uri=f"runs:/{run_id}/model",
name="churn-predictor"
)
print(f"Version: {result.version}")
Stage Transitions
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Transition a model version to Production
client.transition_model_version_stage(
name="churn-predictor",
version=3,
stage="Production",
archive_existing_versions=True # archives current Production version
)
client.set_registered_model_alias("churn-predictor", "champion", version=3).Loading Models from Registry
# Load by stage (legacy)
model = mlflow.pyfunc.load_model("models:/churn-predictor/Production")
# Load by alias (recommended)
model = mlflow.pyfunc.load_model("models:/churn-predictor@champion")
# Load by version number
model = mlflow.pyfunc.load_model("models:/churn-predictor/3")
Databricks Model Serving
Model Serving provides managed, real-time REST API endpoints for MLflow models. It handles auto-scaling, load balancing, and infrastructure management.
Creating a Serving Endpoint
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import (
EndpointCoreConfigInput,
ServedEntityInput
)
w = WorkspaceClient()
# Create endpoint with a single served model
w.serving_endpoints.create(
name="churn-predictor-endpoint",
config=EndpointCoreConfigInput(
served_entities=[
ServedEntityInput(
entity_name="churn-predictor",
entity_version="3",
workload_size="Small",
scale_to_zero_enabled=True
)
]
)
)
Querying the Endpoint
import requests
import json
# Score a single record
url = "https://<workspace-url>/serving-endpoints/churn-predictor-endpoint/invocations"
headers = {"Authorization": f"Bearer {token}", "Content-Type": "application/json"}
data = {
"dataframe_records": [
{"total_purchases": 15, "avg_session_duration": 320, "days_since_last_visit": 7}
]
}
response = requests.post(url, headers=headers, json=data)
predictions = response.json()
dataframe_records (list of row dictionaries) and dataframe_split (columns + data arrays). Know the difference. Also note that scale_to_zero_enabled=True means the endpoint can scale down to zero replicas when idle, reducing cost but adding cold-start latency.Traffic Splitting (A/B Testing)
You can route traffic between multiple model versions for A/B testing:
config = EndpointCoreConfigInput(
served_entities=[
ServedEntityInput(
name="champion",
entity_name="churn-predictor",
entity_version="3",
workload_size="Small",
scale_to_zero_enabled=False
),
ServedEntityInput(
name="challenger",
entity_name="churn-predictor",
entity_version="4",
workload_size="Small",
scale_to_zero_enabled=False
)
],
traffic_config=TrafficConfig(
routes=[
Route(served_model_name="champion", traffic_percentage=90),
Route(served_model_name="challenger", traffic_percentage=10)
]
)
)
Batch Inference
For scoring large datasets, batch inference is more cost-effective than real-time serving:
Using MLflow pyfunc with Spark
import mlflow
# Load model as a Spark UDF
predict_udf = mlflow.pyfunc.spark_udf(
spark,
model_uri="models:/churn-predictor@champion"
)
# Apply to a Spark DataFrame
predictions_df = input_df.withColumn(
"prediction",
predict_udf("total_purchases", "avg_session_duration", "days_since_last_visit")
)
# Write predictions to Delta table
predictions_df.write.format("delta").mode("overwrite").saveAsTable("predictions.churn_scores")
mlflow.pyfunc.spark_udf() converts any MLflow model into a Spark UDF for distributed batch inference. This is the recommended approach for scoring large datasets. The model is serialized and broadcast to each Spark worker. Know that it requires the spark session and a model URI.Feature Store Batch Scoring
from databricks.feature_store import FeatureStoreClient
fs = FeatureStoreClient()
# Score with automatic feature lookup
predictions = fs.score_batch(
model_uri="models:/churn-predictor@champion",
df=new_customers_df # only needs primary key columns
)
# Feature Store automatically joins the required features
Model Monitoring
After deployment, monitor model performance to detect drift and degradation:
- Databricks Lakehouse Monitoring — Automated profiling and drift detection for Delta tables
- Inference tables — Automatically log all prediction requests and responses from serving endpoints
- Data drift — Statistical comparison of input feature distributions between training and serving data
- Model quality — Compare prediction accuracy against ground truth labels (when available)
Practice Questions
Question 1 — Model Registry
A)
client.transition_model_version_stage("model", version=5, stage="Production")B)
client.transition_model_version_stage("model", version=5, stage="Production", archive_existing_versions=True)C)
client.promote_model_version("model", version=5)D)
client.set_registered_model_alias("model", "Production", version=5)Answer: B — The
archive_existing_versions=True parameter automatically moves the current Production version (v3) to Archived when the new version (v5) is promoted. Without this parameter, both versions would be in Production simultaneously. Option C does not exist. Option D uses aliases, which is valid but uses a different paradigm than stages.
Question 2 — Model Serving
scale_to_zero_enabled=True. A request arrives after 30 minutes of inactivity. What happens?A) The request fails with a timeout error
B) The endpoint scales up from zero, causing a cold-start delay, then serves the request
C) The request is queued indefinitely until the endpoint is manually restarted
D) The endpoint is deleted after inactivity and must be recreated
Answer: B — When
scale_to_zero_enabled=True, the endpoint reduces to zero replicas during inactivity to save costs. When a new request arrives, it automatically scales up, which introduces cold-start latency (typically 30-120 seconds). The request is served after the endpoint is ready. This is ideal for intermittent workloads where cost matters more than latency.
Question 3 — Batch Inference
A) Load the model with
mlflow.pyfunc.load_model() and iterate through rows on the driverB) Send all rows to the Model Serving endpoint using the REST API
C) Use
mlflow.pyfunc.spark_udf() and apply it to a Spark DataFrameD) Export the data to CSV and score it locally
Answer: C —
mlflow.pyfunc.spark_udf() distributes the scoring across the entire Spark cluster. The model is broadcast to each worker, and predictions are computed in parallel. Option A runs on a single node and cannot handle 100M rows efficiently. Option B is for real-time, not batch. Option D is impractical at this scale.
Question 4 — Traffic Splitting
A) Create two separate endpoints and use a load balancer
B) Configure traffic routes on a single endpoint with two served entities and 90/10 split
C) Deploy the new version to Staging and the old to Production in the Model Registry
D) Use feature flags in the application code to route requests
Answer: B — Databricks Model Serving natively supports traffic splitting with multiple served entities on a single endpoint. Configure two served models with
traffic_percentage values of 90 and 10. This is the built-in A/B testing mechanism. No external load balancer or application-level routing is needed.
Question 5 — Model Loading
A)
models:/churn-predictor/ProductionB)
models:/churn-predictor@championC)
models:/churn-predictor/latestD)
runs:/<run_id>/modelAnswer: B — The
@alias syntax (e.g., models:/model-name@champion) is the newer approach that uses model aliases instead of stage-based references. Option A uses the legacy stage-based URI. Option C is not a valid format. Option D loads from a specific run, not the registry.