Advanced

Model Serving & Infrastructure

Multi-model applications need robust infrastructure for serving multiple AI models simultaneously. This lesson covers the full spectrum — from API-based services to self-hosted GPU inference, optimization techniques, and cost management.

API-Based vs Self-Hosted Serving

The first architectural decision is whether to use managed API services or host models yourself:

ApproachProsConsBest For
API-based (OpenAI, Anthropic, Google)Zero infrastructure, instant scaling, latest modelsPer-token costs, data leaves your network, rate limitsMost production apps, prototyping
Self-hosted (vLLM, TGI, Triton)Full control, data privacy, fixed costs at scaleGPU costs, ops complexity, model updates are manualHigh-volume, regulated industries, custom models
HybridBest of both — route by task complexityMore complex routing logicCost-optimized production systems

Self-Hosted Inference Engines

vLLM

vLLM is the most popular open-source LLM inference engine. Its key innovation is PagedAttention, which manages GPU memory like an operating system manages virtual memory, enabling much higher throughput.

Python — Serving a Model with vLLM + FastAPI
# Install: pip install vllm fastapi uvicorn

# Option 1: Use vLLM's built-in OpenAI-compatible server
# Terminal: python -m vllm.entrypoints.openai.api_server \
#   --model meta-llama/Llama-3.1-8B-Instruct \
#   --tensor-parallel-size 1 \
#   --max-model-len 8192 \
#   --port 8000

# Option 2: Custom FastAPI wrapper for multi-model serving
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from vllm import LLM, SamplingParams
from typing import Optional

app = FastAPI(title="Multi-Model Inference API")

# Load multiple models on startup
models = {}

@app.on_event("startup")
async def load_models():
    # Small model for simple tasks (classification, extraction)
    models["small"] = LLM(
        model="meta-llama/Llama-3.1-8B-Instruct",
        tensor_parallel_size=1,
        max_model_len=4096,
        gpu_memory_utilization=0.4,  # Use 40% of GPU memory
    )
    # Large model for complex reasoning
    models["large"] = LLM(
        model="meta-llama/Llama-3.1-70B-Instruct",
        tensor_parallel_size=4,      # Spread across 4 GPUs
        max_model_len=8192,
        gpu_memory_utilization=0.9,
    )

class CompletionRequest(BaseModel):
    prompt: str
    model_size: str = "small"       # "small" or "large"
    max_tokens: int = 1024
    temperature: float = 0.7
    top_p: float = 0.95

class CompletionResponse(BaseModel):
    text: str
    model: str
    tokens_generated: int
    latency_ms: float

@app.post("/v1/completions", response_model=CompletionResponse)
async def generate(request: CompletionRequest):
    import time
    if request.model_size not in models:
        raise HTTPException(status_code=400, detail=f"Unknown model: {request.model_size}")

    start = time.time()
    sampling_params = SamplingParams(
        max_tokens=request.max_tokens,
        temperature=request.temperature,
        top_p=request.top_p,
    )
    outputs = models[request.model_size].generate([request.prompt], sampling_params)
    elapsed = (time.time() - start) * 1000

    generated_text = outputs[0].outputs[0].text
    return CompletionResponse(
        text=generated_text,
        model=request.model_size,
        tokens_generated=len(outputs[0].outputs[0].token_ids),
        latency_ms=round(elapsed, 2),
    )

# Run: uvicorn server:app --host 0.0.0.0 --port 8000

Text Generation Inference (TGI)

TGI by Hugging Face is a production-ready inference server with built-in support for continuous batching, token streaming, and quantization.

Bash — Running TGI with Docker
# Pull and run TGI with NVIDIA GPU support
docker run --gpus all --shm-size 1g -p 8080:80 \
  -v /data:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct \
  --quantize awq \
  --max-input-tokens 4096 \
  --max-total-tokens 8192 \
  --max-batch-prefill-tokens 4096

# Query the server (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Explain PagedAttention in 3 sentences"}],
    "max_tokens": 256,
    "stream": true
  }'

BentoML Multi-Model Service

BentoML excels at packaging multiple models into a single deployable service with independent scaling:

Python — BentoML Multi-Model Service
import bentoml
from bentoml.io import JSON, Text
import numpy as np

# service.py - Define a multi-model service
@bentoml.service(
    resources={"gpu": 1, "memory": "16Gi"},
    traffic={"timeout": 300},
)
class MultiModelService:
    # Embedding model for vector search
    embedding_model = bentoml.depends(EmbeddingService)
    # LLM for text generation
    llm = bentoml.depends(LLMService)
    # Reranker for improving retrieval quality
    reranker = bentoml.depends(RerankerService)

    @bentoml.api
    async def rag_query(self, query: str, documents: list[str]) -> dict:
        # Step 1: Embed the query
        query_embedding = await self.embedding_model.embed(query)

        # Step 2: Embed all documents
        doc_embeddings = await self.embedding_model.embed_batch(documents)

        # Step 3: Find top-k similar documents
        similarities = np.dot(doc_embeddings, query_embedding)
        top_indices = np.argsort(similarities)[-10:][::-1]
        candidates = [documents[i] for i in top_indices]

        # Step 4: Rerank candidates
        reranked = await self.reranker.rerank(query, candidates)
        top_docs = reranked[:3]

        # Step 5: Generate answer with LLM
        context = "\n".join(top_docs)
        prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
        answer = await self.llm.generate(prompt)

        return {"answer": answer, "sources": top_docs}

# bentofile.yaml
# service: "service:MultiModelService"
# include: ["*.py"]
# python:
#   packages: [torch, transformers, numpy, vllm]
# docker:
#   gpu: true

# Build and deploy:
# bentoml build
# bentoml serve .
# bentoml containerize multi_model_service:latest

Serving Solutions Comparison

SolutionTypeGPU SupportScalingCostBest For
vLLMSelf-hostedNVIDIA, AMDManual / K8sGPU cost onlyHigh-throughput LLM inference
TGISelf-hostedNVIDIAManual / K8sGPU cost onlyHugging Face model deployment
TritonSelf-hostedNVIDIAK8s + TritonGPU cost onlyMulti-framework model serving
BentoMLSelf-hosted / CloudNVIDIABentoCloudFree / Cloud pricingMulti-model packaging
ModalServerless GPUNVIDIA (A100, H100)AutomaticPer-second billingBursty workloads
ReplicateServerless GPUNVIDIAAutomaticPer-predictionQuick deployment of open models
Together AIAPIManagedAutomaticPer-tokenOpen-model API access
RunPodGPU CloudNVIDIAManualPer-hour GPUCost-effective GPU rental

Model Optimization Techniques

Running large models efficiently requires optimization. The main techniques reduce memory usage and increase throughput:

Quantization

Quantization reduces model precision from 16-bit to 8-bit or 4-bit, dramatically cutting memory requirements:

Python — Quantization Methods Comparison
# GPTQ - Post-training quantization (calibration-based)
# Best quality, slower quantization process
# Use with: vLLM, TGI, transformers
# Example: TheBloke/Llama-3.1-8B-Instruct-GPTQ

# AWQ - Activation-aware Weight Quantization
# Fast inference, good quality, smaller files
# Use with: vLLM, TGI
# Example: TheBloke/Llama-3.1-8B-Instruct-AWQ

# GGUF - Format for llama.cpp (CPU + GPU inference)
# Runs on consumer hardware, CPU offloading
# Use with: llama.cpp, Ollama, LM Studio
# Example: TheBloke/Llama-3.1-8B-Instruct-GGUF (Q4_K_M variant)

# Memory comparison for Llama 3.1 8B:
# FP16:  ~16 GB VRAM
# INT8:  ~8 GB VRAM  (GPTQ-8bit)
# INT4:  ~4 GB VRAM  (AWQ-4bit / GPTQ-4bit)
# Q4_K_M: ~5 GB RAM  (GGUF, can run on CPU)

# Loading a GPTQ model with vLLM
from vllm import LLM

model = LLM(
    model="TheBloke/Llama-3.1-8B-Instruct-GPTQ",
    quantization="gptq",
    dtype="half",
    max_model_len=4096,
)

# Loading an AWQ model with vLLM
model_awq = LLM(
    model="TheBloke/Llama-3.1-8B-Instruct-AWQ",
    quantization="awq",
    max_model_len=4096,
)

GPU Management

Efficient GPU utilization is critical for cost-effective multi-model serving:

  • Tensor Parallelism: Split a single large model across multiple GPUs. A 70B model needs 4x A100 80GB GPUs.
  • GPU Sharing: Run multiple small models on one GPU by limiting gpu_memory_utilization per model (e.g., 0.3 each for 3 models).
  • NVIDIA MPS (Multi-Process Service): Allows multiple processes to share a single GPU with minimal overhead.
  • NVIDIA MIG (Multi-Instance GPU): Partition an A100/H100 into isolated GPU instances for different models.
Bash — Docker with NVIDIA GPU Support
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Run with specific GPU allocation
docker run --gpus '"device=0,1"' -p 8000:8000 my-model-server

# Check GPU utilization inside container
docker exec -it my-container nvidia-smi

Autoscaling Strategies

Different scaling triggers work best for different serving patterns:

  • Request-based: Scale up when requests per second exceed a threshold. Simple but can lag behind demand spikes.
  • Queue-based: Scale based on pending request queue depth. Better for bursty workloads — scale when queue grows, scale down when empty.
  • GPU utilization: Scale when GPU compute utilization exceeds 80%. Works well for steady workloads.
  • Latency-based: Scale up when P95 latency exceeds your SLA target. Directly tied to user experience.
YAML — Kubernetes HPA for GPU-Based Scaling
# Horizontal Pod Autoscaler for model serving
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-serving-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-serving
  minReplicas: 1
  maxReplicas: 8
  metrics:
    # Scale on custom GPU utilization metric
    - type: Pods
      pods:
        metric:
          name: gpu_utilization_percent
        target:
          type: AverageValue
          averageValue: "80"
    # Also scale on request queue depth
    - type: External
      external:
        metric:
          name: request_queue_depth
        target:
          type: AverageValue
          averageValue: "10"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

Caching Strategies

Caching is one of the highest-leverage optimizations for multi-model systems:

  • Prompt caching: Provider-level feature (Anthropic, OpenAI) that caches the processed prompt prefix, reducing latency and cost for repeated system prompts.
  • KV cache: GPU-level cache of key-value attention states. vLLM's PagedAttention optimizes this automatically.
  • Semantic caching: Cache responses by embedding similarity. If a new query is semantically similar to a cached one, return the cached response.
  • Response caching: Simple key-value cache (Redis) for exact-match queries. Eliminates model calls entirely for repeated questions.
Python — Semantic Cache with Redis + Embeddings
import redis
import json
import numpy as np
from openai import OpenAI

client = OpenAI()
cache = redis.Redis(host="localhost", port=6379, db=0)

SIMILARITY_THRESHOLD = 0.95  # How similar queries must be to hit cache

def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small", input=text
    )
    return response.data[0].embedding

def cosine_similarity(a: list, b: list) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def cached_completion(query: str, system_prompt: str = "") -> str:
    # Step 1: Embed the query
    query_embedding = get_embedding(query)

    # Step 2: Check cache for similar queries
    cached_entries = cache.keys("semantic_cache:*")
    for key in cached_entries:
        entry = json.loads(cache.get(key))
        similarity = cosine_similarity(query_embedding, entry["embedding"])
        if similarity >= SIMILARITY_THRESHOLD:
            print(f"Cache hit! Similarity: {similarity:.3f}")
            return entry["response"]

    # Step 3: Cache miss - call the model
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query},
        ],
    )
    result = response.choices[0].message.content

    # Step 4: Store in cache with TTL
    cache_key = f"semantic_cache:{hash(query)}"
    cache.setex(cache_key, 3600, json.dumps({
        "query": query,
        "embedding": query_embedding,
        "response": result,
    }))
    return result

Cost Optimization: Smart Model Routing

A powerful cost optimization strategy is routing requests to different models based on complexity. Use a cheap, fast model for simple tasks and an expensive, capable model only when needed:

Python — Complexity-Based Model Router
from anthropic import Anthropic
from openai import OpenAI
from enum import Enum

class Complexity(Enum):
    SIMPLE = "simple"      # FAQ, classification, extraction
    MODERATE = "moderate"  # Summarization, translation
    COMPLEX = "complex"    # Analysis, reasoning, code generation

# Cost per 1M tokens (input/output) - approximate
MODEL_TIERS = {
    Complexity.SIMPLE: {
        "model": "gpt-4o-mini",
        "cost_per_1m_input": 0.15,
        "cost_per_1m_output": 0.60,
    },
    Complexity.MODERATE: {
        "model": "claude-sonnet-4-20250514",
        "cost_per_1m_input": 3.00,
        "cost_per_1m_output": 15.00,
    },
    Complexity.COMPLEX: {
        "model": "claude-opus-4-20250514",
        "cost_per_1m_input": 15.00,
        "cost_per_1m_output": 75.00,
    },
}

class SmartRouter:
    def __init__(self):
        self.openai = OpenAI()
        self.anthropic = Anthropic()

    def classify_complexity(self, query: str) -> Complexity:
        """Use a cheap model to classify query complexity."""
        response = self.openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "system",
                "content": "Classify the complexity: SIMPLE, MODERATE, or COMPLEX. Reply with one word only.",
            }, {
                "role": "user", "content": query,
            }],
            max_tokens=10,
        )
        label = response.choices[0].message.content.strip().upper()
        return Complexity(label.lower()) if label.lower() in [c.value for c in Complexity] else Complexity.MODERATE

    def route(self, query: str) -> str:
        complexity = self.classify_complexity(query)
        tier = MODEL_TIERS[complexity]
        model = tier["model"]
        print(f"Routing to {model} (complexity: {complexity.value})")

        if "claude" in model:
            resp = self.anthropic.messages.create(
                model=model, max_tokens=4096,
                messages=[{"role": "user", "content": query}],
            )
            return resp.content[0].text
        else:
            resp = self.openai.chat.completions.create(
                model=model, max_tokens=4096,
                messages=[{"role": "user", "content": query}],
            )
            return resp.choices[0].message.content

router = SmartRouter()
# Simple query -> gpt-4o-mini (~$0.15/1M tokens)
router.route("What is the capital of France?")
# Complex query -> claude-opus-4 (~$15/1M tokens)
router.route("Analyze the trade-offs between microservices and monolithic architecture for a real-time trading system.")

Latency Budgets in Multi-Model Pipelines

When chaining multiple model calls, each step consumes part of your total latency budget. A typical pipeline might have a 5-second budget:

  • Embedding (50ms): Embed user query for retrieval
  • Vector search (20ms): Find relevant documents
  • Reranking (200ms): Rerank retrieved documents
  • LLM generation (3000ms): Generate the final response
  • Safety check (500ms): Validate output with a classifier
  • Buffer (1230ms): Network overhead, queue wait times
Optimization priority: Focus on the slowest step first. In most pipelines, LLM generation dominates latency. Use streaming to improve perceived latency — users see tokens arriving immediately rather than waiting for the full response.