Intermediate

Model Deployment

Deploy trained models to production using Watson Machine Learning — online endpoints, batch scoring, REST APIs, model versioning, and monitoring.

Watson Machine Learning (WML)

Watson Machine Learning is IBM's service for deploying, managing, and monitoring AI models in production. It supports models built with scikit-learn, TensorFlow, Keras, PyTorch, and other frameworks.

Deployment Workflow

  1. Save the model — Serialize the trained model (joblib for scikit-learn, SavedModel for TensorFlow)
  2. Store in WML — Upload the model to the Watson ML repository with metadata
  3. Create a deployment — Deploy as an online endpoint or batch job
  4. Test the endpoint — Send test data to verify predictions
  5. Integrate — Connect the endpoint to your application via REST API
# Deploy a model using the IBM Watson ML Python client
from ibm_watson_machine_learning import APIClient

# Connect to WML
wml_credentials = {
    "url": "https://us-south.ml.cloud.ibm.com",
    "apikey": "your-api-key"
}
client = APIClient(wml_credentials)

# Store the model
model_details = client.repository.store_model(
    model=trained_model,
    meta_props={
        client.repository.ModelMetaNames.NAME: "churn-predictor",
        client.repository.ModelMetaNames.TYPE: "scikit-learn_1.1",
        client.repository.ModelMetaNames.SOFTWARE_SPEC_UID:
            client.software_specifications.get_id_by_name("runtime-23.1-py3.10")
    }
)

# Deploy as online endpoint
deployment = client.deployments.create(
    artifact_uid=model_id,
    meta_props={
        client.deployments.ConfigurationMetaNames.NAME: "churn-deployment",
        client.deployments.ConfigurationMetaNames.ONLINE: {}
    }
)

Deployment Types

Online Deployment

Creates a REST API endpoint for real-time predictions. Each request receives an immediate response.

  • Use case: Real-time scoring (fraud detection, recommendations, chatbots)
  • Latency: Milliseconds to seconds
  • Scaling: IBM manages autoscaling based on request volume

Batch Deployment

Processes large datasets in bulk. Submit a job with input data and retrieve results when complete.

  • Use case: Scoring entire customer databases, nightly reports, bulk predictions
  • Cost: More cost-effective for large volumes than individual API calls
  • Data sources: Can read from Cloud Object Storage, databases, or inline data

REST API Integration

# Scoring with the deployed model via REST API
import requests

scoring_url = "https://us-south.ml.cloud.ibm.com/..."
headers = {
    "Authorization": "Bearer " + token,
    "Content-Type": "application/json"
}

payload = {
    "input_data": [{
        "fields": ["age", "tenure", "monthly_charges"],
        "values": [[35, 12, 65.50], [55, 3, 85.00]]
    }]
}

response = requests.post(scoring_url, json=payload, headers=headers)
predictions = response.json()["predictions"]

Model Versioning and Lifecycle

  • Model versioning — Store multiple versions of the same model in the WML repository
  • A/B deployment — Route traffic between model versions to compare performance
  • Rollback — Quickly revert to a previous model version if the new one underperforms
  • Model lineage — Track which data and code produced each model version

Model Monitoring

IBM OpenScale (now part of Watsonx.governance) monitors deployed models:

  • Quality monitoring — Track accuracy, precision, recall over time with feedback data
  • Drift monitoring — Detect when input data distributions shift from training data
  • Fairness monitoring — Check for bias across protected attributes (age, gender, race)
  • Explainability — Generate explanations for individual predictions (SHAP-based)
Monitoring is not optional: Models degrade over time due to data drift. Without monitoring, you will not know when a model starts making bad predictions. Set up automated alerts for quality drops and retrain when performance falls below your threshold.

Practice Questions

📝
Q1: A retail company needs to score 5 million customer records nightly to generate personalized offers. Which WML deployment type should they use?

A) Online deployment
B) Batch deployment
C) Edge deployment
D) Streaming deployment
Show Answer

B) Batch deployment. Batch deployment is designed for processing large volumes of data in bulk. Scoring 5 million records individually through an online endpoint would be slow and expensive. Batch jobs process the entire dataset efficiently in a single run.

📝
Q2: After deploying a model, you notice its accuracy drops from 92% to 78% over three months. The model has not changed. What is the most likely cause?

A) The REST API is failing
B) Data drift
C) The model is overfitting
D) The server ran out of memory
Show Answer

B) Data drift. When a model's accuracy degrades over time without any changes to the model itself, the most likely cause is that the input data distribution has shifted from what the model was trained on. Retraining with recent data restores accuracy.

📝
Q3: Which step comes FIRST in the Watson ML deployment workflow?

A) Create a REST endpoint
B) Monitor for drift
C) Store the trained model in the WML repository
D) Send scoring requests
Show Answer

C) Store the trained model in the WML repository. Before creating a deployment, you must first store the trained model in the Watson ML repository with metadata (name, type, software spec). Then you create a deployment from the stored model, and finally you can send scoring requests.

📝
Q4: A fraud detection system needs to return a risk score within 200 milliseconds for each transaction. Which deployment type is appropriate?

A) Batch deployment
B) Online deployment
C) Scheduled job
D) Manual scoring
Show Answer

B) Online deployment. Real-time fraud detection requires immediate responses. An online deployment creates a REST API endpoint that returns predictions in milliseconds, meeting the 200ms latency requirement. Batch deployment processes data in bulk and is not suitable for real-time use cases.

📝
Q5: Which monitoring capability checks if a deployed model's predictions are biased against specific demographic groups?

A) Quality monitoring
B) Drift monitoring
C) Fairness monitoring
D) Performance monitoring
Show Answer

C) Fairness monitoring. Fairness monitoring in Watsonx.governance checks model predictions across protected attributes (age, gender, race, etc.) to detect disparate impact or bias. It alerts when predictions show unfair patterns and suggests debiasing actions.