Intermediate

Model Serving Architecture

Model serving is where your ML model meets real users. The decisions you make here — synchronous vs. async, which inference server, how to deploy new versions — directly determine latency, cost, and reliability. This lesson covers production-tested patterns.

Synchronous vs. Asynchronous Inference

Pattern	How It Works	Latency	Best For	Example
Synchronous	Client sends request, waits for response	Must be fast (< 500ms)	Real-time decisions	Search ranking, fraud detection, autocomplete
Async (webhook)	Client sends request, gets callback when done	Can be slow (seconds to minutes)	Batch/heavy processing	Image generation, document analysis, video processing
Streaming	Server sends partial results as they are generated	First token fast, total slow	LLM chat interfaces	ChatGPT-style responses, real-time transcription

# Async inference pattern with queue
# Good for: image generation, document processing, any task > 1 second

import asyncio
from fastapi import FastAPI, BackgroundTasks
from redis import Redis
import uuid

app = FastAPI()
redis = Redis(host="redis", port=6379)

@app.post("/predict")
async def submit_prediction(request: PredictRequest):
    """Accept request immediately, process in background."""
    job_id = str(uuid.uuid4())
    redis.set(f"job:{job_id}:status", "queued")

    # Push to processing queue
    redis.rpush("inference_queue", json.dumps({
        "job_id": job_id,
        "input": request.dict()
    }))

    return {"job_id": job_id, "status": "queued",
            "poll_url": f"/predict/{job_id}"}

@app.get("/predict/{job_id}")
async def get_prediction(job_id: str):
    """Poll for results."""
    status = redis.get(f"job:{job_id}:status")
    if status == b"completed":
        result = json.loads(redis.get(f"job:{job_id}:result"))
        return {"status": "completed", "result": result}
    return {"status": status.decode(), "retry_after_ms": 1000}

# Worker process (runs on GPU nodes)
async def worker():
    """Pulls jobs from queue and runs inference."""
    model = load_model("s3://models/image-gen-v2/model.safetensors")
    while True:
        job_data = redis.blpop("inference_queue", timeout=30)
        if job_data:
            job = json.loads(job_data[1])
            redis.set(f"job:{job['job_id']}:status", "processing")
            result = model.generate(job["input"])
            redis.set(f"job:{job['job_id']}:result", json.dumps(result))
            redis.set(f"job:{job['job_id']}:status", "completed")
            redis.expire(f"job:{job['job_id']}:result", 3600)  # TTL: 1 hour

Model Server Comparison

Choosing the right inference server saves weeks of engineering and can cut inference costs by 50% or more through optimizations like dynamic batching and model optimization.

Server	Framework	Best For	Dynamic Batching	Multi-Model	GPU Optimization
TorchServe	PyTorch	PyTorch models, custom handlers	Yes	Yes	TorchScript, torch.compile
Triton	Any (ONNX, TF, PyTorch, TensorRT)	Multi-framework, max throughput	Yes (advanced)	Yes (ensemble)	TensorRT, CUDA graphs
vLLM	LLM-specific	LLM serving (Llama, Mistral, etc.)	Continuous batching	Single model	PagedAttention, speculative
TF Serving	TensorFlow	TensorFlow/Keras models	Yes	Yes	XLA compilation
BentoML	Any	Easy deployment, API-first	Adaptive batching	Yes	Via ONNX Runtime
Ray Serve	Any	Complex pipelines, multi-step	Custom	Yes (DAG)	Via backends

💡

Decision shortcut: Serving an LLM? Use vLLM — its PagedAttention and continuous batching are purpose-built for autoregressive generation and outperform general-purpose servers by 2–5x. Serving a CNN/classification model? Use Triton with TensorRT for maximum throughput. Just need something working fast? Use BentoML.

Canary Deployments and A/B Testing for Models

Deploying a new model version is riskier than deploying new code. A model that tests well offline can perform poorly on live traffic due to distribution shift, missing features, or edge cases not in the test set.

Shadow Mode (Day 1–3)

Deploy new model alongside production model. Both receive the same requests, but only the old model's responses are served to users. Log new model's predictions for offline comparison. This catches crashes, latency regressions, and obvious quality issues with zero user impact.

Canary (Day 3–7)

Route 5% of traffic to the new model. Monitor model-specific metrics (accuracy, latency, error rate) and business metrics (CTR, conversion). Automated rollback if any metric degrades beyond threshold.

Gradual Rollout (Day 7–14)

Increase traffic: 5% → 25% → 50% → 100%. At each step, validate metrics for 24+ hours before increasing. The full rollout takes 1–2 weeks for critical models.

A/B Test (Optional, 2–4 weeks)

For revenue-critical models, run a proper A/B test with statistical significance. Hold 50% on old model, 50% on new. Measure business outcomes with proper experiment design (randomization unit, minimum detectable effect, power analysis).

Multi-Model Serving and Model Routing

Production systems often serve multiple models simultaneously. A model router directs each request to the appropriate model based on the input characteristics.

# Multi-model routing architecture
# Route requests to different models based on complexity/cost trade-offs

class ModelRouter:
    def __init__(self):
        self.models = {
            "fast": load_model("distilbert-base"),      # 2ms, cheap, 90% accuracy
            "balanced": load_model("bert-base"),         # 15ms, moderate, 95% accuracy
            "accurate": load_model("bert-large-ensemble"), # 80ms, expensive, 98% accuracy
        }

    def route(self, request: PredictRequest) -> str:
        """Route based on request characteristics."""
        # High-value customers get the best model
        if request.customer_tier == "enterprise":
            return "accurate"

        # Short, simple inputs use the fast model
        if len(request.text) < 50 and request.confidence_threshold < 0.9:
            return "fast"

        # Default to balanced
        return "balanced"

    async def predict(self, request: PredictRequest):
        model_name = self.route(request)
        result = await self.models[model_name].predict(request)

        # If fast model is uncertain, escalate to accurate model
        if model_name == "fast" and result.confidence < 0.7:
            result = await self.models["balanced"].predict(request)

        return result

# Result: 70% of requests hit the fast model → massive cost savings
# Only 10% of requests need the expensive model

Production Kubernetes Deployment for Model Serving

Here is a production-ready Kubernetes deployment for a GPU-based model serving endpoint with health checks, resource limits, and auto-scaling.

# model-serving-deployment.yaml
# Production deployment for Triton Inference Server on GPU

apiVersion: apps/v1
kind: Deployment
metadata:
  name: recommendation-model
  labels:
    app: recommendation-model
    version: v2.3.1
spec:
  replicas: 3  # Minimum replicas
  selector:
    matchLabels:
      app: recommendation-model
  template:
    metadata:
      labels:
        app: recommendation-model
        version: v2.3.1
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.01-py3
        args:
          - tritonserver
          - --model-repository=s3://models/recommendation/v2.3.1
          - --model-control-mode=explicit
          - --load-model=recommendation_model
          - --log-verbose=0
        ports:
          - containerPort: 8000  # HTTP
            name: http
          - containerPort: 8001  # gRPC
            name: grpc
          - containerPort: 8002  # Metrics
            name: metrics
        resources:
          requests:
            cpu: "4"
            memory: "16Gi"
            nvidia.com/gpu: "1"     # Request 1 GPU
          limits:
            cpu: "8"
            memory: "32Gi"
            nvidia.com/gpu: "1"
        livenessProbe:
          httpGet:
            path: /v2/health/live
            port: 8000
          initialDelaySeconds: 60   # Model loading takes time
          periodSeconds: 10
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /v2/health/ready
            port: 8000
          initialDelaySeconds: 90
          periodSeconds: 5
          failureThreshold: 5
        startupProbe:
          httpGet:
            path: /v2/health/ready
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 30      # Allow 5 min for model download + load
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-a10g
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

---
# Horizontal Pod Autoscaler for GPU inference
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: recommendation-model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: recommendation-model
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: triton_inference_queue_duration_us  # Custom metric
      target:
        type: AverageValue
        averageValue: "5000"   # Scale up when avg queue time > 5ms
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60   # Wait 1 min before scaling up
      policies:
      - type: Pods
        value: 2                        # Add max 2 pods at a time
        periodSeconds: 120
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Pods
        value: 1                        # Remove 1 pod at a time
        periodSeconds: 300              # Every 5 minutes

⚠

GPU scaling is slow: Unlike CPU pods that start in seconds, GPU pods need 2–5 minutes to start (node provisioning + image pull + model download + model loading). Set scaleUp.stabilizationWindowSeconds appropriately and keep enough warm replicas to handle traffic spikes while new pods start.

💡

Apply at work tomorrow: If you are serving a model behind a REST API without dynamic batching, you are leaving 3–5x throughput on the table. Switch to Triton or TorchServe with dynamic batching enabled, and measure the throughput difference. For LLMs, switch to vLLM with continuous batching for an immediate 2–4x improvement.

← Previous Data Architecture Patterns Next → Scaling AI Systems