Model Serving & Infrastructure
Multi-model applications need robust infrastructure for serving multiple AI models simultaneously. This lesson covers the full spectrum — from API-based services to self-hosted GPU inference, optimization techniques, and cost management.
API-Based vs Self-Hosted Serving
The first architectural decision is whether to use managed API services or host models yourself:
| Approach | Pros | Cons | Best For |
|---|---|---|---|
| API-based (OpenAI, Anthropic, Google) | Zero infrastructure, instant scaling, latest models | Per-token costs, data leaves your network, rate limits | Most production apps, prototyping |
| Self-hosted (vLLM, TGI, Triton) | Full control, data privacy, fixed costs at scale | GPU costs, ops complexity, model updates are manual | High-volume, regulated industries, custom models |
| Hybrid | Best of both — route by task complexity | More complex routing logic | Cost-optimized production systems |
Self-Hosted Inference Engines
vLLM
vLLM is the most popular open-source LLM inference engine. Its key innovation is PagedAttention, which manages GPU memory like an operating system manages virtual memory, enabling much higher throughput.
# Install: pip install vllm fastapi uvicorn # Option 1: Use vLLM's built-in OpenAI-compatible server # Terminal: python -m vllm.entrypoints.openai.api_server \ # --model meta-llama/Llama-3.1-8B-Instruct \ # --tensor-parallel-size 1 \ # --max-model-len 8192 \ # --port 8000 # Option 2: Custom FastAPI wrapper for multi-model serving from fastapi import FastAPI, HTTPException from pydantic import BaseModel from vllm import LLM, SamplingParams from typing import Optional app = FastAPI(title="Multi-Model Inference API") # Load multiple models on startup models = {} @app.on_event("startup") async def load_models(): # Small model for simple tasks (classification, extraction) models["small"] = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=1, max_model_len=4096, gpu_memory_utilization=0.4, # Use 40% of GPU memory ) # Large model for complex reasoning models["large"] = LLM( model="meta-llama/Llama-3.1-70B-Instruct", tensor_parallel_size=4, # Spread across 4 GPUs max_model_len=8192, gpu_memory_utilization=0.9, ) class CompletionRequest(BaseModel): prompt: str model_size: str = "small" # "small" or "large" max_tokens: int = 1024 temperature: float = 0.7 top_p: float = 0.95 class CompletionResponse(BaseModel): text: str model: str tokens_generated: int latency_ms: float @app.post("/v1/completions", response_model=CompletionResponse) async def generate(request: CompletionRequest): import time if request.model_size not in models: raise HTTPException(status_code=400, detail=f"Unknown model: {request.model_size}") start = time.time() sampling_params = SamplingParams( max_tokens=request.max_tokens, temperature=request.temperature, top_p=request.top_p, ) outputs = models[request.model_size].generate([request.prompt], sampling_params) elapsed = (time.time() - start) * 1000 generated_text = outputs[0].outputs[0].text return CompletionResponse( text=generated_text, model=request.model_size, tokens_generated=len(outputs[0].outputs[0].token_ids), latency_ms=round(elapsed, 2), ) # Run: uvicorn server:app --host 0.0.0.0 --port 8000
Text Generation Inference (TGI)
TGI by Hugging Face is a production-ready inference server with built-in support for continuous batching, token streaming, and quantization.
# Pull and run TGI with NVIDIA GPU support docker run --gpus all --shm-size 1g -p 8080:80 \ -v /data:/data \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id meta-llama/Llama-3.1-8B-Instruct \ --quantize awq \ --max-input-tokens 4096 \ --max-total-tokens 8192 \ --max-batch-prefill-tokens 4096 # Query the server (OpenAI-compatible) curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Explain PagedAttention in 3 sentences"}], "max_tokens": 256, "stream": true }'
BentoML Multi-Model Service
BentoML excels at packaging multiple models into a single deployable service with independent scaling:
import bentoml from bentoml.io import JSON, Text import numpy as np # service.py - Define a multi-model service @bentoml.service( resources={"gpu": 1, "memory": "16Gi"}, traffic={"timeout": 300}, ) class MultiModelService: # Embedding model for vector search embedding_model = bentoml.depends(EmbeddingService) # LLM for text generation llm = bentoml.depends(LLMService) # Reranker for improving retrieval quality reranker = bentoml.depends(RerankerService) @bentoml.api async def rag_query(self, query: str, documents: list[str]) -> dict: # Step 1: Embed the query query_embedding = await self.embedding_model.embed(query) # Step 2: Embed all documents doc_embeddings = await self.embedding_model.embed_batch(documents) # Step 3: Find top-k similar documents similarities = np.dot(doc_embeddings, query_embedding) top_indices = np.argsort(similarities)[-10:][::-1] candidates = [documents[i] for i in top_indices] # Step 4: Rerank candidates reranked = await self.reranker.rerank(query, candidates) top_docs = reranked[:3] # Step 5: Generate answer with LLM context = "\n".join(top_docs) prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:" answer = await self.llm.generate(prompt) return {"answer": answer, "sources": top_docs} # bentofile.yaml # service: "service:MultiModelService" # include: ["*.py"] # python: # packages: [torch, transformers, numpy, vllm] # docker: # gpu: true # Build and deploy: # bentoml build # bentoml serve . # bentoml containerize multi_model_service:latest
Serving Solutions Comparison
| Solution | Type | GPU Support | Scaling | Cost | Best For |
|---|---|---|---|---|---|
| vLLM | Self-hosted | NVIDIA, AMD | Manual / K8s | GPU cost only | High-throughput LLM inference |
| TGI | Self-hosted | NVIDIA | Manual / K8s | GPU cost only | Hugging Face model deployment |
| Triton | Self-hosted | NVIDIA | K8s + Triton | GPU cost only | Multi-framework model serving |
| BentoML | Self-hosted / Cloud | NVIDIA | BentoCloud | Free / Cloud pricing | Multi-model packaging |
| Modal | Serverless GPU | NVIDIA (A100, H100) | Automatic | Per-second billing | Bursty workloads |
| Replicate | Serverless GPU | NVIDIA | Automatic | Per-prediction | Quick deployment of open models |
| Together AI | API | Managed | Automatic | Per-token | Open-model API access |
| RunPod | GPU Cloud | NVIDIA | Manual | Per-hour GPU | Cost-effective GPU rental |
Model Optimization Techniques
Running large models efficiently requires optimization. The main techniques reduce memory usage and increase throughput:
Quantization
Quantization reduces model precision from 16-bit to 8-bit or 4-bit, dramatically cutting memory requirements:
# GPTQ - Post-training quantization (calibration-based) # Best quality, slower quantization process # Use with: vLLM, TGI, transformers # Example: TheBloke/Llama-3.1-8B-Instruct-GPTQ # AWQ - Activation-aware Weight Quantization # Fast inference, good quality, smaller files # Use with: vLLM, TGI # Example: TheBloke/Llama-3.1-8B-Instruct-AWQ # GGUF - Format for llama.cpp (CPU + GPU inference) # Runs on consumer hardware, CPU offloading # Use with: llama.cpp, Ollama, LM Studio # Example: TheBloke/Llama-3.1-8B-Instruct-GGUF (Q4_K_M variant) # Memory comparison for Llama 3.1 8B: # FP16: ~16 GB VRAM # INT8: ~8 GB VRAM (GPTQ-8bit) # INT4: ~4 GB VRAM (AWQ-4bit / GPTQ-4bit) # Q4_K_M: ~5 GB RAM (GGUF, can run on CPU) # Loading a GPTQ model with vLLM from vllm import LLM model = LLM( model="TheBloke/Llama-3.1-8B-Instruct-GPTQ", quantization="gptq", dtype="half", max_model_len=4096, ) # Loading an AWQ model with vLLM model_awq = LLM( model="TheBloke/Llama-3.1-8B-Instruct-AWQ", quantization="awq", max_model_len=4096, )
GPU Management
Efficient GPU utilization is critical for cost-effective multi-model serving:
- Tensor Parallelism: Split a single large model across multiple GPUs. A 70B model needs 4x A100 80GB GPUs.
- GPU Sharing: Run multiple small models on one GPU by limiting
gpu_memory_utilizationper model (e.g., 0.3 each for 3 models). - NVIDIA MPS (Multi-Process Service): Allows multiple processes to share a single GPU with minimal overhead.
- NVIDIA MIG (Multi-Instance GPU): Partition an A100/H100 into isolated GPU instances for different models.
# Install NVIDIA Container Toolkit distribution=$(. /etc/os-release; echo $ID$VERSION_ID) curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker # Run with specific GPU allocation docker run --gpus '"device=0,1"' -p 8000:8000 my-model-server # Check GPU utilization inside container docker exec -it my-container nvidia-smi
Autoscaling Strategies
Different scaling triggers work best for different serving patterns:
- Request-based: Scale up when requests per second exceed a threshold. Simple but can lag behind demand spikes.
- Queue-based: Scale based on pending request queue depth. Better for bursty workloads — scale when queue grows, scale down when empty.
- GPU utilization: Scale when GPU compute utilization exceeds 80%. Works well for steady workloads.
- Latency-based: Scale up when P95 latency exceeds your SLA target. Directly tied to user experience.
# Horizontal Pod Autoscaler for model serving apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-serving-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: llm-serving minReplicas: 1 maxReplicas: 8 metrics: # Scale on custom GPU utilization metric - type: Pods pods: metric: name: gpu_utilization_percent target: type: AverageValue averageValue: "80" # Also scale on request queue depth - type: External external: metric: name: request_queue_depth target: type: AverageValue averageValue: "10" behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Pods value: 2 periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300 # Wait 5 min before scaling down policies: - type: Pods value: 1 periodSeconds: 120
Caching Strategies
Caching is one of the highest-leverage optimizations for multi-model systems:
- Prompt caching: Provider-level feature (Anthropic, OpenAI) that caches the processed prompt prefix, reducing latency and cost for repeated system prompts.
- KV cache: GPU-level cache of key-value attention states. vLLM's PagedAttention optimizes this automatically.
- Semantic caching: Cache responses by embedding similarity. If a new query is semantically similar to a cached one, return the cached response.
- Response caching: Simple key-value cache (Redis) for exact-match queries. Eliminates model calls entirely for repeated questions.
import redis import json import numpy as np from openai import OpenAI client = OpenAI() cache = redis.Redis(host="localhost", port=6379, db=0) SIMILARITY_THRESHOLD = 0.95 # How similar queries must be to hit cache def get_embedding(text: str) -> list[float]: response = client.embeddings.create( model="text-embedding-3-small", input=text ) return response.data[0].embedding def cosine_similarity(a: list, b: list) -> float: a, b = np.array(a), np.array(b) return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))) def cached_completion(query: str, system_prompt: str = "") -> str: # Step 1: Embed the query query_embedding = get_embedding(query) # Step 2: Check cache for similar queries cached_entries = cache.keys("semantic_cache:*") for key in cached_entries: entry = json.loads(cache.get(key)) similarity = cosine_similarity(query_embedding, entry["embedding"]) if similarity >= SIMILARITY_THRESHOLD: print(f"Cache hit! Similarity: {similarity:.3f}") return entry["response"] # Step 3: Cache miss - call the model response = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": query}, ], ) result = response.choices[0].message.content # Step 4: Store in cache with TTL cache_key = f"semantic_cache:{hash(query)}" cache.setex(cache_key, 3600, json.dumps({ "query": query, "embedding": query_embedding, "response": result, })) return result
Cost Optimization: Smart Model Routing
A powerful cost optimization strategy is routing requests to different models based on complexity. Use a cheap, fast model for simple tasks and an expensive, capable model only when needed:
from anthropic import Anthropic from openai import OpenAI from enum import Enum class Complexity(Enum): SIMPLE = "simple" # FAQ, classification, extraction MODERATE = "moderate" # Summarization, translation COMPLEX = "complex" # Analysis, reasoning, code generation # Cost per 1M tokens (input/output) - approximate MODEL_TIERS = { Complexity.SIMPLE: { "model": "gpt-4o-mini", "cost_per_1m_input": 0.15, "cost_per_1m_output": 0.60, }, Complexity.MODERATE: { "model": "claude-sonnet-4-20250514", "cost_per_1m_input": 3.00, "cost_per_1m_output": 15.00, }, Complexity.COMPLEX: { "model": "claude-opus-4-20250514", "cost_per_1m_input": 15.00, "cost_per_1m_output": 75.00, }, } class SmartRouter: def __init__(self): self.openai = OpenAI() self.anthropic = Anthropic() def classify_complexity(self, query: str) -> Complexity: """Use a cheap model to classify query complexity.""" response = self.openai.chat.completions.create( model="gpt-4o-mini", messages=[{ "role": "system", "content": "Classify the complexity: SIMPLE, MODERATE, or COMPLEX. Reply with one word only.", }, { "role": "user", "content": query, }], max_tokens=10, ) label = response.choices[0].message.content.strip().upper() return Complexity(label.lower()) if label.lower() in [c.value for c in Complexity] else Complexity.MODERATE def route(self, query: str) -> str: complexity = self.classify_complexity(query) tier = MODEL_TIERS[complexity] model = tier["model"] print(f"Routing to {model} (complexity: {complexity.value})") if "claude" in model: resp = self.anthropic.messages.create( model=model, max_tokens=4096, messages=[{"role": "user", "content": query}], ) return resp.content[0].text else: resp = self.openai.chat.completions.create( model=model, max_tokens=4096, messages=[{"role": "user", "content": query}], ) return resp.choices[0].message.content router = SmartRouter() # Simple query -> gpt-4o-mini (~$0.15/1M tokens) router.route("What is the capital of France?") # Complex query -> claude-opus-4 (~$15/1M tokens) router.route("Analyze the trade-offs between microservices and monolithic architecture for a real-time trading system.")
Latency Budgets in Multi-Model Pipelines
When chaining multiple model calls, each step consumes part of your total latency budget. A typical pipeline might have a 5-second budget:
- Embedding (50ms): Embed user query for retrieval
- Vector search (20ms): Find relevant documents
- Reranking (200ms): Rerank retrieved documents
- LLM generation (3000ms): Generate the final response
- Safety check (500ms): Validate output with a classifier
- Buffer (1230ms): Network overhead, queue wait times