Model Serving Architecture
Model serving is where your ML model meets real users. The decisions you make here — synchronous vs. async, which inference server, how to deploy new versions — directly determine latency, cost, and reliability. This lesson covers production-tested patterns.
Synchronous vs. Asynchronous Inference
| Pattern | How It Works | Latency | Best For | Example |
|---|---|---|---|---|
| Synchronous | Client sends request, waits for response | Must be fast (< 500ms) | Real-time decisions | Search ranking, fraud detection, autocomplete |
| Async (webhook) | Client sends request, gets callback when done | Can be slow (seconds to minutes) | Batch/heavy processing | Image generation, document analysis, video processing |
| Streaming | Server sends partial results as they are generated | First token fast, total slow | LLM chat interfaces | ChatGPT-style responses, real-time transcription |
# Async inference pattern with queue
# Good for: image generation, document processing, any task > 1 second
import asyncio
from fastapi import FastAPI, BackgroundTasks
from redis import Redis
import uuid
app = FastAPI()
redis = Redis(host="redis", port=6379)
@app.post("/predict")
async def submit_prediction(request: PredictRequest):
"""Accept request immediately, process in background."""
job_id = str(uuid.uuid4())
redis.set(f"job:{job_id}:status", "queued")
# Push to processing queue
redis.rpush("inference_queue", json.dumps({
"job_id": job_id,
"input": request.dict()
}))
return {"job_id": job_id, "status": "queued",
"poll_url": f"/predict/{job_id}"}
@app.get("/predict/{job_id}")
async def get_prediction(job_id: str):
"""Poll for results."""
status = redis.get(f"job:{job_id}:status")
if status == b"completed":
result = json.loads(redis.get(f"job:{job_id}:result"))
return {"status": "completed", "result": result}
return {"status": status.decode(), "retry_after_ms": 1000}
# Worker process (runs on GPU nodes)
async def worker():
"""Pulls jobs from queue and runs inference."""
model = load_model("s3://models/image-gen-v2/model.safetensors")
while True:
job_data = redis.blpop("inference_queue", timeout=30)
if job_data:
job = json.loads(job_data[1])
redis.set(f"job:{job['job_id']}:status", "processing")
result = model.generate(job["input"])
redis.set(f"job:{job['job_id']}:result", json.dumps(result))
redis.set(f"job:{job['job_id']}:status", "completed")
redis.expire(f"job:{job['job_id']}:result", 3600) # TTL: 1 hour
Model Server Comparison
Choosing the right inference server saves weeks of engineering and can cut inference costs by 50% or more through optimizations like dynamic batching and model optimization.
| Server | Framework | Best For | Dynamic Batching | Multi-Model | GPU Optimization |
|---|---|---|---|---|---|
| TorchServe | PyTorch | PyTorch models, custom handlers | Yes | Yes | TorchScript, torch.compile |
| Triton | Any (ONNX, TF, PyTorch, TensorRT) | Multi-framework, max throughput | Yes (advanced) | Yes (ensemble) | TensorRT, CUDA graphs |
| vLLM | LLM-specific | LLM serving (Llama, Mistral, etc.) | Continuous batching | Single model | PagedAttention, speculative |
| TF Serving | TensorFlow | TensorFlow/Keras models | Yes | Yes | XLA compilation |
| BentoML | Any | Easy deployment, API-first | Adaptive batching | Yes | Via ONNX Runtime |
| Ray Serve | Any | Complex pipelines, multi-step | Custom | Yes (DAG) | Via backends |
Canary Deployments and A/B Testing for Models
Deploying a new model version is riskier than deploying new code. A model that tests well offline can perform poorly on live traffic due to distribution shift, missing features, or edge cases not in the test set.
Shadow Mode (Day 1–3)
Deploy new model alongside production model. Both receive the same requests, but only the old model's responses are served to users. Log new model's predictions for offline comparison. This catches crashes, latency regressions, and obvious quality issues with zero user impact.
Canary (Day 3–7)
Route 5% of traffic to the new model. Monitor model-specific metrics (accuracy, latency, error rate) and business metrics (CTR, conversion). Automated rollback if any metric degrades beyond threshold.
Gradual Rollout (Day 7–14)
Increase traffic: 5% → 25% → 50% → 100%. At each step, validate metrics for 24+ hours before increasing. The full rollout takes 1–2 weeks for critical models.
A/B Test (Optional, 2–4 weeks)
For revenue-critical models, run a proper A/B test with statistical significance. Hold 50% on old model, 50% on new. Measure business outcomes with proper experiment design (randomization unit, minimum detectable effect, power analysis).
Multi-Model Serving and Model Routing
Production systems often serve multiple models simultaneously. A model router directs each request to the appropriate model based on the input characteristics.
# Multi-model routing architecture
# Route requests to different models based on complexity/cost trade-offs
class ModelRouter:
def __init__(self):
self.models = {
"fast": load_model("distilbert-base"), # 2ms, cheap, 90% accuracy
"balanced": load_model("bert-base"), # 15ms, moderate, 95% accuracy
"accurate": load_model("bert-large-ensemble"), # 80ms, expensive, 98% accuracy
}
def route(self, request: PredictRequest) -> str:
"""Route based on request characteristics."""
# High-value customers get the best model
if request.customer_tier == "enterprise":
return "accurate"
# Short, simple inputs use the fast model
if len(request.text) < 50 and request.confidence_threshold < 0.9:
return "fast"
# Default to balanced
return "balanced"
async def predict(self, request: PredictRequest):
model_name = self.route(request)
result = await self.models[model_name].predict(request)
# If fast model is uncertain, escalate to accurate model
if model_name == "fast" and result.confidence < 0.7:
result = await self.models["balanced"].predict(request)
return result
# Result: 70% of requests hit the fast model → massive cost savings
# Only 10% of requests need the expensive model
Production Kubernetes Deployment for Model Serving
Here is a production-ready Kubernetes deployment for a GPU-based model serving endpoint with health checks, resource limits, and auto-scaling.
# model-serving-deployment.yaml
# Production deployment for Triton Inference Server on GPU
apiVersion: apps/v1
kind: Deployment
metadata:
name: recommendation-model
labels:
app: recommendation-model
version: v2.3.1
spec:
replicas: 3 # Minimum replicas
selector:
matchLabels:
app: recommendation-model
template:
metadata:
labels:
app: recommendation-model
version: v2.3.1
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.01-py3
args:
- tritonserver
- --model-repository=s3://models/recommendation/v2.3.1
- --model-control-mode=explicit
- --load-model=recommendation_model
- --log-verbose=0
ports:
- containerPort: 8000 # HTTP
name: http
- containerPort: 8001 # gRPC
name: grpc
- containerPort: 8002 # Metrics
name: metrics
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1" # Request 1 GPU
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1"
livenessProbe:
httpGet:
path: /v2/health/live
port: 8000
initialDelaySeconds: 60 # Model loading takes time
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8000
initialDelaySeconds: 90
periodSeconds: 5
failureThreshold: 5
startupProbe:
httpGet:
path: /v2/health/ready
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 30 # Allow 5 min for model download + load
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-tesla-a10g
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
# Horizontal Pod Autoscaler for GPU inference
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: recommendation-model-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: recommendation-model
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: triton_inference_queue_duration_us # Custom metric
target:
type: AverageValue
averageValue: "5000" # Scale up when avg queue time > 5ms
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Wait 1 min before scaling up
policies:
- type: Pods
value: 2 # Add max 2 pods at a time
periodSeconds: 120
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Pods
value: 1 # Remove 1 pod at a time
periodSeconds: 300 # Every 5 minutes
scaleUp.stabilizationWindowSeconds appropriately and keep enough warm replicas to handle traffic spikes while new pods start.
Lilly Tech Systems