Advanced

Serving & Monitoring

Model serving and monitoring together account for ~36% of the exam (Domains 4 and 6). This lesson covers online and batch prediction, endpoint management, traffic splitting, model monitoring, drift detection, and retraining strategies.

Online Prediction vs. Batch Prediction

This is one of the most fundamental decisions on the exam. Know when to use each:

FeatureOnline PredictionBatch Prediction
LatencyLow (milliseconds)High (minutes to hours)
ThroughputOne request at a timeMillions of records at once
InfrastructureAlways-on endpointSpun up per job, then shut down
Cost modelPay for uptime (machine hours)Pay per prediction job
Use casesReal-time recommendations, fraud detectionNightly reports, bulk scoring, email campaigns
GCP serviceVertex AI EndpointsVertex AI Batch Prediction
💡
Exam Decision Rule: If the question mentions "real-time," "low latency," or "user-facing," choose online prediction. If it mentions "nightly," "all customers," "bulk scoring," or "cost optimization," choose batch prediction.

Vertex AI Endpoints

Endpoints are the serving infrastructure for online predictions. Key concepts:

Endpoint Configuration

  • Machine type: Choose based on model size and latency requirements (n1-standard for small models, GPU-attached for DL models)
  • Min/max replicas: Configure autoscaling based on traffic patterns
  • Traffic split: Route traffic percentages to different model versions (for A/B testing and canary deployments)
  • Private endpoints: Restrict access to VPC-internal traffic for security

Model Deployment Options

📊

Pre-built Serving Containers

Use for standard frameworks (TF SavedModel, PyTorch, XGBoost, scikit-learn). Google maintains optimized serving containers with TF Serving and TorchServe.

📌

Custom Serving Containers

Use when you need custom pre/post-processing, non-standard frameworks, or multi-model serving. Must implement a health check and prediction endpoint.

🛠

Model Garden

Deploy pre-trained foundation models (Gemini, PaLM, open-source LLMs) with one click. Handles infrastructure automatically.

Traffic Splitting and Canary Deployments

Vertex AI supports deploying multiple model versions to a single endpoint with traffic splitting:

  • Canary deployment: Route 5–10% of traffic to the new model, monitor metrics, then gradually increase
  • A/B testing: Split traffic 50/50 between two models to compare business metrics
  • Blue-green deployment: Deploy new model to a separate endpoint, switch DNS when validated
  • Shadow deployment: Route 100% of traffic to both models, only use old model's predictions, compare offline
Exam Distinction: A/B testing compares models on business metrics (revenue, click-through rate). Model evaluation compares on ML metrics (accuracy, AUC). The exam tests whether you know when to use each. A/B testing requires live traffic; evaluation uses held-out test data.

Model Monitoring

Vertex AI Model Monitoring automatically detects data drift and prediction drift. This is Domain 6 (~18% of the exam).

Types of Drift

Drift TypeWhat ChangesDetection MethodExample
Data drift (covariate shift)Input feature distributions changeJensen-Shannon divergence, L-infinity distanceAverage age of users shifts from 30 to 45
Prediction driftModel output distributions changeCompare prediction distributions over timeModel suddenly predicts "positive" 80% vs. 50%
Concept driftThe relationship between inputs and target changesMonitor performance metrics (requires labels)Consumer behavior changes post-pandemic
Feature attribution driftFeature importance changesCompare Shapley values over time"Location" suddenly becomes the top predictor

Vertex AI Model Monitoring Configuration

  • Training baseline: The reference distribution from your training data
  • Alert thresholds: Set per-feature drift thresholds (e.g., Jensen-Shannon divergence > 0.1)
  • Sampling rate: Percentage of predictions to log for monitoring (balance cost vs. coverage)
  • Monitoring frequency: Hourly, daily, or custom intervals
  • Alert channels: Cloud Monitoring, email, Pub/Sub, or Cloud Functions for automated response

Logging and Observability

Complete ML observability on GCP involves multiple services:

  • Cloud Logging: Prediction request/response logs, error logs, model server logs
  • Cloud Monitoring: Latency, throughput, error rates, GPU utilization metrics
  • Vertex AI Model Monitoring: Data drift, prediction drift, feature attribution
  • BigQuery logging: Store prediction logs in BigQuery for long-term analysis and debugging

When to Retrain

The exam tests your ability to determine when and how to retrain models:

🕑

Scheduled Retraining

Retrain on a fixed schedule (daily, weekly, monthly). Simple and predictable. Best when data changes at a known pace (e.g., daily sales data).

Triggered Retraining

Retrain when monitoring detects drift above threshold. More efficient than scheduled but requires monitoring infrastructure. Use Pub/Sub to trigger pipeline.

🔄

Continuous Training

Continuously train on new data as it arrives. Most resource-intensive but keeps model freshest. Best for rapidly changing domains (ad click prediction, stock trading).

Practice Questions

📝
Question 1: An e-commerce company needs to generate product recommendations for all 10 million customers to send in a daily marketing email. Recommendations must be ready by 6 AM each day. Which serving approach should you use?

A. Vertex AI online prediction endpoint with autoscaling
B. Vertex AI batch prediction job scheduled at 2 AM
C. Cloud Run with a custom Flask serving container
D. BigQuery ML PREDICT function
Answer: B. Batch prediction is designed for bulk scoring millions of records. Schedule the job at 2 AM to ensure results are ready by 6 AM. Online prediction (A) would be inefficient and expensive for 10M sequential requests. Cloud Run (C) does not provide batch infrastructure. BigQuery ML (D) could work if the model was trained in BQML, but the question implies a Vertex AI model.
📝
Question 2: You deployed a new fraud detection model and want to validate it with real traffic before fully replacing the old model. You want to compare both models' predictions but only serve the old model's responses to users. What deployment strategy should you use?

A. Canary deployment with 10% traffic split
B. A/B testing with 50/50 split
C. Shadow deployment (mirror traffic to new model)
D. Blue-green deployment with instant cutover
Answer: C. Shadow deployment sends a copy of all production traffic to the new model, but only returns the old model's predictions to users. This lets you compare both models' outputs on real data without any risk to users. Canary (A) and A/B (B) both serve the new model's predictions to some users, which is risky for fraud detection. Blue-green (D) is all-or-nothing.
📝
Question 3: Your model monitoring dashboard shows that the "user_age" feature has a Jensen-Shannon divergence of 0.3 (threshold is 0.1) compared to the training data. The model's accuracy has not yet degraded. What should you do?

A. Ignore it since accuracy is still good
B. Immediately retrain the model with recent data
C. Investigate the cause, collect labeled data from the new distribution, and plan retraining
D. Increase the drift threshold to 0.5 to reduce alerts
Answer: C. Data drift is a leading indicator — accuracy degradation often follows. The correct approach is: (1) investigate why the distribution shifted (new user demographic? data pipeline bug?), (2) collect labels for the new distribution to measure actual impact, (3) plan retraining with the new data. Ignoring (A) is risky. Immediate retraining (B) without investigation may train on bad data. Raising thresholds (D) is sweeping the problem under the rug.
📝
Question 4: Your online prediction endpoint serves a TensorFlow model. During peak hours, latency increases from 50ms to 500ms. Which two actions would most effectively reduce latency? (Select two)

A. Enable autoscaling with a lower CPU utilization target
B. Switch from online to batch prediction
C. Add GPU accelerators to the serving machines
D. Increase the monitoring sampling rate
E. Reduce model size using quantization or distillation
Answer: A and E. Autoscaling (A) adds more replicas during peak hours to distribute load. Model optimization (E) through quantization (FP32 to INT8) or knowledge distillation directly reduces per-request latency. Batch prediction (B) changes the serving paradigm entirely. Monitoring (D) does not affect latency. GPUs (C) might help for large DL models but autoscaling is more directly effective for load-based latency.