Serving & Monitoring
Model serving and monitoring together account for ~36% of the exam (Domains 4 and 6). This lesson covers online and batch prediction, endpoint management, traffic splitting, model monitoring, drift detection, and retraining strategies.
Online Prediction vs. Batch Prediction
This is one of the most fundamental decisions on the exam. Know when to use each:
| Feature | Online Prediction | Batch Prediction |
|---|---|---|
| Latency | Low (milliseconds) | High (minutes to hours) |
| Throughput | One request at a time | Millions of records at once |
| Infrastructure | Always-on endpoint | Spun up per job, then shut down |
| Cost model | Pay for uptime (machine hours) | Pay per prediction job |
| Use cases | Real-time recommendations, fraud detection | Nightly reports, bulk scoring, email campaigns |
| GCP service | Vertex AI Endpoints | Vertex AI Batch Prediction |
Vertex AI Endpoints
Endpoints are the serving infrastructure for online predictions. Key concepts:
Endpoint Configuration
- Machine type: Choose based on model size and latency requirements (n1-standard for small models, GPU-attached for DL models)
- Min/max replicas: Configure autoscaling based on traffic patterns
- Traffic split: Route traffic percentages to different model versions (for A/B testing and canary deployments)
- Private endpoints: Restrict access to VPC-internal traffic for security
Model Deployment Options
Pre-built Serving Containers
Use for standard frameworks (TF SavedModel, PyTorch, XGBoost, scikit-learn). Google maintains optimized serving containers with TF Serving and TorchServe.
Custom Serving Containers
Use when you need custom pre/post-processing, non-standard frameworks, or multi-model serving. Must implement a health check and prediction endpoint.
Model Garden
Deploy pre-trained foundation models (Gemini, PaLM, open-source LLMs) with one click. Handles infrastructure automatically.
Traffic Splitting and Canary Deployments
Vertex AI supports deploying multiple model versions to a single endpoint with traffic splitting:
- Canary deployment: Route 5–10% of traffic to the new model, monitor metrics, then gradually increase
- A/B testing: Split traffic 50/50 between two models to compare business metrics
- Blue-green deployment: Deploy new model to a separate endpoint, switch DNS when validated
- Shadow deployment: Route 100% of traffic to both models, only use old model's predictions, compare offline
Model Monitoring
Vertex AI Model Monitoring automatically detects data drift and prediction drift. This is Domain 6 (~18% of the exam).
Types of Drift
| Drift Type | What Changes | Detection Method | Example |
|---|---|---|---|
| Data drift (covariate shift) | Input feature distributions change | Jensen-Shannon divergence, L-infinity distance | Average age of users shifts from 30 to 45 |
| Prediction drift | Model output distributions change | Compare prediction distributions over time | Model suddenly predicts "positive" 80% vs. 50% |
| Concept drift | The relationship between inputs and target changes | Monitor performance metrics (requires labels) | Consumer behavior changes post-pandemic |
| Feature attribution drift | Feature importance changes | Compare Shapley values over time | "Location" suddenly becomes the top predictor |
Vertex AI Model Monitoring Configuration
- Training baseline: The reference distribution from your training data
- Alert thresholds: Set per-feature drift thresholds (e.g., Jensen-Shannon divergence > 0.1)
- Sampling rate: Percentage of predictions to log for monitoring (balance cost vs. coverage)
- Monitoring frequency: Hourly, daily, or custom intervals
- Alert channels: Cloud Monitoring, email, Pub/Sub, or Cloud Functions for automated response
Logging and Observability
Complete ML observability on GCP involves multiple services:
- Cloud Logging: Prediction request/response logs, error logs, model server logs
- Cloud Monitoring: Latency, throughput, error rates, GPU utilization metrics
- Vertex AI Model Monitoring: Data drift, prediction drift, feature attribution
- BigQuery logging: Store prediction logs in BigQuery for long-term analysis and debugging
When to Retrain
The exam tests your ability to determine when and how to retrain models:
Scheduled Retraining
Retrain on a fixed schedule (daily, weekly, monthly). Simple and predictable. Best when data changes at a known pace (e.g., daily sales data).
Triggered Retraining
Retrain when monitoring detects drift above threshold. More efficient than scheduled but requires monitoring infrastructure. Use Pub/Sub to trigger pipeline.
Continuous Training
Continuously train on new data as it arrives. Most resource-intensive but keeps model freshest. Best for rapidly changing domains (ad click prediction, stock trading).
Practice Questions
A. Vertex AI online prediction endpoint with autoscaling
B. Vertex AI batch prediction job scheduled at 2 AM
C. Cloud Run with a custom Flask serving container
D. BigQuery ML PREDICT function
A. Canary deployment with 10% traffic split
B. A/B testing with 50/50 split
C. Shadow deployment (mirror traffic to new model)
D. Blue-green deployment with instant cutover
A. Ignore it since accuracy is still good
B. Immediately retrain the model with recent data
C. Investigate the cause, collect labeled data from the new distribution, and plan retraining
D. Increase the drift threshold to 0.5 to reduce alerts
A. Enable autoscaling with a lower CPU utilization target
B. Switch from online to batch prediction
C. Add GPU accelerators to the serving machines
D. Increase the monitoring sampling rate
E. Reduce model size using quantization or distillation