Advanced

Model Deployment with MLflow

Model deployment covers ~20% of the certification exam. This lesson covers serving models locally as REST APIs, batch scoring with pyfunc, containerizing with Docker, deploying to cloud platforms, and includes practice questions.

Local Model Serving

MLflow can serve any logged model as a REST API endpoint with a single CLI command. This is the most common deployment method tested on the exam.

# Serve a model from a run as a REST API
# mlflow models serve --model-uri runs:/<run_id>/model --port 5001

# Serve a model from the Model Registry
# mlflow models serve --model-uri models:/fraud-detector/Production --port 5001

# Serve with specific environment manager
# mlflow models serve --model-uri models:/fraud-detector/1 --env-manager conda

# The server exposes two endpoints:
# POST /invocations - Make predictions
# GET  /ping        - Health check

# Making predictions with curl:
# curl -X POST http://localhost:5001/invocations \
#   -H "Content-Type: application/json" \
#   -d '{"dataframe_split": {"columns": ["f1", "f2"], "data": [[1.0, 2.0], [3.0, 4.0]]}}'

# Alternative input formats:
input_formats = {
    "dataframe_split": {
        "description": "Pandas DataFrame in split orientation (recommended)",
        "content_type": "application/json",
        "example": {
            "dataframe_split": {
                "columns": ["feature1", "feature2"],
                "data": [[1.0, 2.0], [3.0, 4.0]]
            }
        }
    },
    "dataframe_records": {
        "description": "Pandas DataFrame in records orientation",
        "content_type": "application/json",
        "example": {
            "dataframe_records": [
                {"feature1": 1.0, "feature2": 2.0},
                {"feature1": 3.0, "feature2": 4.0}
            ]
        }
    },
    "instances": {
        "description": "TensorFlow Serving compatible format",
        "content_type": "application/json",
        "example": {
            "instances": [[1.0, 2.0], [3.0, 4.0]]
        }
    },
    "csv": {
        "description": "CSV format input",
        "content_type": "text/csv",
        "example": "feature1,feature2\n1.0,2.0\n3.0,4.0"
    }
}

# EXAM TIP: Know the input format names and Content-Type headers
# dataframe_split is the recommended JSON format
# The /invocations endpoint handles predictions
# The /ping endpoint is for health checks

Batch Scoring with PyFunc

For batch inference (scoring large datasets), load the model as a pyfunc and call .predict() directly without a server.

import mlflow
import pandas as pd

# Load model as pyfunc for batch scoring
model = mlflow.pyfunc.load_model("models:/fraud-detector/Production")

# Score a pandas DataFrame
input_df = pd.DataFrame({
    "amount": [100.0, 5000.0, 25.0],
    "merchant_category": ["grocery", "electronics", "gas"],
    "hour_of_day": [14, 2, 8]
})
predictions = model.predict(input_df)

# Score a large dataset in chunks
import numpy as np

large_dataset = pd.read_csv("transactions.csv")
chunk_size = 10000
all_predictions = []

for i in range(0, len(large_dataset), chunk_size):
    chunk = large_dataset.iloc[i:i+chunk_size]
    preds = model.predict(chunk)
    all_predictions.extend(preds)

# Score with Spark (for distributed batch scoring)
import mlflow.pyfunc
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
spark_df = spark.read.parquet("transactions.parquet")

# Create a Spark UDF from the model
predict_udf = mlflow.pyfunc.spark_udf(spark, "models:/fraud-detector/Production")
result_df = spark_df.withColumn("prediction", predict_udf())

# EXAM TIP: Batch scoring does NOT require a running server
# Use mlflow.pyfunc.load_model() + .predict() for batch
# Use mlflow.pyfunc.spark_udf() for distributed Spark scoring

Docker Deployment

MLflow can build Docker images from logged models, making it easy to deploy models in containerized environments.

# Build a Docker image from a model
# mlflow models build-docker --model-uri runs:/<run_id>/model --name my-model-image

# Build from the Model Registry
# mlflow models build-docker --model-uri models:/fraud-detector/Production --name fraud-model

# Run the Docker container
# docker run -p 5001:8080 my-model-image

# The container exposes:
# - Port 8080 (internal) mapped to your chosen port
# - /invocations endpoint for predictions
# - /ping endpoint for health checks

# Build with custom environment manager
# mlflow models build-docker \
#   --model-uri models:/fraud-detector/1 \
#   --name fraud-model \
#   --env-manager conda

# Docker image contents:
docker_image_contents = {
    "model_artifacts": "The serialized model and its dependencies",
    "mlflow_serving": "MLflow's model serving infrastructure",
    "environment": "Python environment with all required packages",
    "entrypoint": "Starts the model server on port 8080"
}

# EXAM TIP: mlflow models build-docker creates a self-contained image
# The container serves on port 8080 by default
# Same /invocations and /ping endpoints as local serving
# No need to install MLflow or dependencies on the host

Cloud Deployment

MLflow supports deployment to major cloud platforms through built-in plugins and the deployments API.

# AWS SageMaker deployment
import mlflow.sagemaker

mlflow.sagemaker.deploy(
    app_name="fraud-detector-endpoint",
    model_uri="models:/fraud-detector/Production",
    region_name="us-east-1",
    mode="create",             # "create", "replace", or "add"
    instance_type="ml.m5.large",
    instance_count=1
)

# Azure ML deployment
# Requires azureml-mlflow package
# mlflow.set_tracking_uri(azureml_tracking_uri)
# Models registered in Azure ML can be deployed to:
# - Azure Container Instances (ACI) for testing
# - Azure Kubernetes Service (AKS) for production

# Databricks Model Serving
# Models registered in Databricks can be served with:
# - Databricks Model Serving (managed endpoint)
# - Serverless Real-Time Inference
# Configured through the Databricks UI or REST API

# Generic deployments API (plugin-based)
from mlflow.deployments import get_deploy_client

client = get_deploy_client("sagemaker")  # or "azureml", etc.
client.create_deployment(
    name="my-deployment",
    model_uri="models:/fraud-detector/Production",
    config={"instance_type": "ml.m5.large"}
)

# List deployments
deployments = client.list_deployments()

# Get deployment status
status = client.get_deployment("my-deployment")

# Update deployment
client.update_deployment(
    name="my-deployment",
    model_uri="models:/fraud-detector/2"  # New version
)

# Delete deployment
client.delete_deployment("my-deployment")

# EXAM TIP: Know the deployments API pattern:
# get_deploy_client() -> create/list/get/update/delete
# Know SageMaker deploy modes: create, replace, add
# Databricks Model Serving is the managed option on Databricks

REST API Input/Output Formats

Understanding the exact JSON input and output formats is critical for the exam. The serving endpoint expects specific data structures.

# Input format: dataframe_split (RECOMMENDED)
# Most commonly tested format on the exam
request_split = {
    "dataframe_split": {
        "columns": ["age", "income", "credit_score"],
        "data": [
            [25, 50000, 720],
            [45, 120000, 680],
            [33, 75000, 750]
        ]
    }
}

# Input format: dataframe_records
request_records = {
    "dataframe_records": [
        {"age": 25, "income": 50000, "credit_score": 720},
        {"age": 45, "income": 120000, "credit_score": 680}
    ]
}

# Input format: instances (TF Serving compatible)
request_instances = {
    "instances": [
        [25, 50000, 720],
        [45, 120000, 680]
    ]
}

# Output format: JSON array of predictions
response = {
    "predictions": [0, 1, 0]
}

# Making a request with Python requests library
import requests
import json

url = "http://localhost:5001/invocations"
headers = {"Content-Type": "application/json"}
response = requests.post(url, json=request_split, headers=headers)
predictions = response.json()

# EXAM TIP: dataframe_split is the preferred input format
# Content-Type must match: application/json for JSON, text/csv for CSV
# Output is always a JSON object with "predictions" key

Practice Questions

Test your understanding of MLflow model deployment with these exam-style questions.

Question 1

Which CLI command serves an MLflow model as a local REST API?

A) mlflow serve --model runs:/abc/model

B) mlflow models serve --model-uri runs:/abc/model

C) mlflow deploy --model-uri runs:/abc/model

D) mlflow run --serve runs:/abc/model

Show Answer

B) mlflow models serve --model-uri runs:/abc/model — The correct command is mlflow models serve with the --model-uri flag. It starts a local Flask server with /invocations and /ping endpoints.

Question 2

What is the recommended JSON input format for the MLflow model serving endpoint?

A) {"inputs": [...]}

B) {"data": [...]}

C) {"dataframe_split": {"columns": [...], "data": [...]}}

D) {"features": [...]}

Show Answer

C) — The dataframe_split format is the recommended input format. It includes column names and data arrays separately, mapping to a pandas DataFrame in split orientation.

Question 3

How do you perform batch scoring without starting a serving endpoint?

A) mlflow.batch.score(model_uri, data)

B) mlflow.pyfunc.load_model(uri).predict(data)

C) mlflow.models.predict(model_uri, data)

D) mlflow.score_batch(model_uri, data)

Show Answer

B) — Load the model with mlflow.pyfunc.load_model() and call .predict() directly. This runs inference in-process without any server. For distributed batch scoring, use mlflow.pyfunc.spark_udf().

Question 4

What default port does an MLflow Docker container expose for serving?

A) 5000

B) 5001

C) 8080

D) 8888

Show Answer

C) 8080 — MLflow Docker containers expose port 8080 by default. When running the container, you map this to your desired external port (e.g., docker run -p 5001:8080 my-model).

Question 5

Which two endpoints does the MLflow model serving server expose?

A) /predict and /health

B) /invocations and /ping

C) /score and /status

D) /inference and /ready

Show Answer

B) /invocations and /ping/invocations accepts POST requests with input data and returns predictions. /ping is a GET endpoint for health checks. These are the same for both local serving and Docker containers.

Question 6

How do you create a Spark UDF from an MLflow model for distributed batch scoring?

A) mlflow.spark.create_udf(model_uri)

B) mlflow.pyfunc.spark_udf(spark, model_uri)

C) mlflow.deployments.spark_udf(model_uri)

D) spark.udf.register("model", model_uri)

Show Answer

B) mlflow.pyfunc.spark_udf(spark, model_uri) — This creates a Spark UDF that wraps the MLflow model. You pass the SparkSession and model URI. The UDF can then be used in Spark DataFrame operations for distributed inference.

Key Takeaways

💡
  • mlflow models serve starts a local REST API with /invocations and /ping endpoints
  • Use dataframe_split as the recommended JSON input format for serving
  • For batch scoring, load with mlflow.pyfunc.load_model() and call .predict() directly
  • mlflow models build-docker creates self-contained Docker images serving on port 8080
  • Use mlflow.pyfunc.spark_udf() for distributed batch scoring with Spark
  • Cloud deployment uses the deployments API: get_deploy_client() with create/update/delete methods