Advanced

MLflow Deployment

Deploy MLflow models to production — as REST APIs, Docker containers, on Kubernetes, or in the cloud.

Serving Models Locally

Shell — MLflow models serve

# Serve a model from a run
mlflow models serve -m "runs:/abc123/model" -p 5001 --no-conda

# Serve a model from the registry
mlflow models serve -m "models:/churn-predictor/Production" -p 5001

# Serve with specific workers
mlflow models serve -m "models:/churn-predictor/1" -p 5001 --workers 4

REST API Endpoint

MLflow serves models with a standard REST API:

Shell — Making predictions via REST API

# JSON input format
curl -X POST http://localhost:5001/invocations \
  -H "Content-Type: application/json" \
  -d '{
    "dataframe_split": {
      "columns": ["age", "income", "tenure", "num_products"],
      "data": [[35, 75000, 24, 3], [28, 45000, 6, 1]]
    }
  }'

# CSV input format
curl -X POST http://localhost:5001/invocations \
  -H "Content-Type: text/csv" \
  -d 'age,income,tenure,num_products
35,75000,24,3
28,45000,6,1'

# Health check
curl http://localhost:5001/health

Docker Deployment

Shell — Build and run Docker container

# Build a Docker image from a logged model
mlflow models build-docker \
  -m "models:/churn-predictor/Production" \
  -n "churn-predictor" \
  --enable-mlserver  # Use MLServer for better performance

# Run the container
docker run -p 5001:8080 churn-predictor

# With environment variables
docker run -p 5001:8080 \
  -e MLFLOW_TRACKING_URI=http://tracking-server:5000 \
  churn-predictor

Kubernetes Deployment

YAML — Kubernetes deployment for MLflow model

apiVersion: apps/v1
kind: Deployment
metadata:
  name: churn-predictor
  labels:
    app: churn-predictor
spec:
  replicas: 3
  selector:
    matchLabels:
      app: churn-predictor
  template:
    metadata:
      labels:
        app: churn-predictor
    spec:
      containers:
      - name: model
        image: churn-predictor:latest
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 15
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: churn-predictor-hpa
spec:
  scaleRef:
    apiVersion: apps/v1
    kind: Deployment
    name: churn-predictor
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Cloud Deployment

AWS SageMaker

Python — Deploy to SageMaker

import mlflow.sagemaker

# Deploy model to SageMaker
mlflow.sagemaker.deploy(
    app_name="churn-predictor",
    model_uri="models:/churn-predictor/Production",
    region_name="us-east-1",
    mode="create",
    instance_type="ml.m5.large",
    instance_count=2,
)

Azure ML

Python — Deploy to Azure ML

import mlflow.azureml

# Build Azure ML image and deploy
azure_model, azure_image = mlflow.azureml.build_image(
    model_uri="models:/churn-predictor/Production",
    workspace=workspace,
    model_name="churn-predictor",
)

# Deploy to Azure Container Instances or AKS
from azureml.core.webservice import AciWebservice
aci_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1)
service = azure_model.deploy(workspace, "churn-service", [azure_image], aci_config)

Batch Inference

Python — Batch predictions with MLflow

import mlflow
import pandas as pd

# Load production model
model = mlflow.pyfunc.load_model("models:/churn-predictor/Production")

# Load batch data
batch_data = pd.read_parquet("s3://data/daily_customers.parquet")

# Generate predictions
predictions = model.predict(batch_data)

# Save results
results = batch_data.assign(
    churn_prediction=predictions,
    prediction_date=pd.Timestamp.now(),
    model_version="Production",
)
results.to_parquet("s3://data/predictions/daily_churn.parquet")

Monitoring Deployed Models

✅

After deployment: Monitor prediction latency, throughput, error rates, and prediction distributions. Set up alerts for anomalies. Log all predictions for later analysis and drift detection. See the MLOps Monitoring lesson for detailed guidance.

← Previous Model Registry Next → Best Practices