Scaling AI Systems
Scaling AI is not just "add more servers." GPUs are expensive, model loading is slow, and naive scaling leads to 80% idle GPUs burning money. This lesson covers the patterns that let you handle 10x traffic without 10x cost.
Horizontal vs. Vertical Scaling for AI
| Strategy | What It Means for AI | When to Use | Cost Impact |
|---|---|---|---|
| Vertical (bigger GPU) | Move from A10G (24GB) to A100 (80GB) to H100 (80GB) | Model does not fit in memory, or need faster single-request latency | 2–5x cost per GPU, but may need fewer GPUs |
| Horizontal (more GPUs) | Add more replicas of the same model on identical GPUs | Need more throughput, model fits on single GPU | Linear cost increase with linear throughput increase |
| Model optimization | Quantize (FP16/INT8), distill, prune to make model smaller/faster | Always try this first — free throughput gains | Reduces cost per request by 2–4x |
Caching Strategies for AI Systems
Caching is the most cost-effective scaling strategy for AI. If you can serve 30% of requests from cache, you need 30% fewer GPUs.
Result Caching
What: Cache the full model output for a given input. Hit rate: 10–60% depending on input diversity. Best for: Recommendation APIs (same user, same context = same results for 5 min), classification of common inputs, FAQ-style chatbot queries. Implementation: Hash the input, store result in Redis with TTL.
Embedding Caching
What: Cache computed embeddings so they do not need to be recomputed. Hit rate: 80–95% for product/document embeddings. Best for: Semantic search, RAG systems, recommendation candidate retrieval. Implementation: Precompute all item embeddings, store in FAISS/Pinecone, update daily.
Prompt Caching
What: Cache the KV-cache for common prompt prefixes in LLMs. Hit rate: 40–80% when system prompts are shared. Best for: LLM applications where many requests share the same system prompt. Implementation: vLLM prefix caching, or Anthropic/OpenAI prompt caching APIs (automatic 50% cost reduction).
Feature Caching
What: Cache computed features to avoid recomputing from raw data. Hit rate: 90%+ for user features. Best for: When feature computation is expensive (complex aggregations, joins). Implementation: Feature store with online serving layer (Redis).
# Multi-layer caching for AI inference
import hashlib
from redis import Redis
class CachedInference:
def __init__(self, model, redis_client: Redis):
self.model = model
self.cache = redis_client
self.RESULT_TTL = 300 # 5 minutes for full results
self.EMBEDDING_TTL = 86400 # 24 hours for embeddings
async def predict(self, request):
# Layer 1: Check result cache (exact match)
cache_key = f"result:{self._hash(request)}"
cached = self.cache.get(cache_key)
if cached:
metrics.increment("cache.result.hit")
return json.loads(cached)
# Layer 2: Check embedding cache (avoid recomputation)
embedding_key = f"emb:{request.item_id}"
cached_embedding = self.cache.get(embedding_key)
if cached_embedding:
metrics.increment("cache.embedding.hit")
embedding = np.frombuffer(cached_embedding, dtype=np.float32)
else:
metrics.increment("cache.embedding.miss")
embedding = self.model.encode(request.text)
self.cache.setex(embedding_key, self.EMBEDDING_TTL,
embedding.tobytes())
# Layer 3: Run inference with cached embedding
result = self.model.rank(embedding, request.context)
# Store in result cache
self.cache.setex(cache_key, self.RESULT_TTL, json.dumps(result))
return result
def _hash(self, request) -> str:
"""Deterministic hash of request for cache key."""
normalized = json.dumps(request.dict(), sort_keys=True)
return hashlib.sha256(normalized.encode()).hexdigest()[:16]
Batching Inference Requests
GPUs are optimized for parallel computation. Processing one request at a time wastes 80%+ of GPU capacity. Dynamic batching groups incoming requests and processes them together.
| Batch Size | Latency per Request | Throughput (req/s) | GPU Utilization | Cost per Request |
|---|---|---|---|---|
| 1 (no batching) | 10ms | 100 | 15% | $0.0028 |
| 4 | 15ms | 267 | 45% | $0.0010 |
| 8 | 22ms | 364 | 65% | $0.0008 |
| 16 | 35ms | 457 | 80% | $0.0006 |
| 32 | 55ms | 582 | 90% | $0.0005 |
| 64 | 100ms | 640 | 93% | $0.0004 |
Notice the trade-off: larger batches give better throughput and lower cost per request, but increase latency. Configure your batch size based on your latency budget.
max_batch_size=32 and batch_timeout_ms=10 to auto-batch without exceeding your latency budget.Model Parallelism vs. Data Parallelism
When a single GPU is not enough — either the model does not fit in memory, or you need more throughput than one GPU can provide.
Data Parallelism
How: Same model copied to N GPUs. Each GPU processes a different batch. Use when: Model fits on one GPU, need more throughput. Scaling: Near-linear up to 8–16 GPUs. Example: 4 GPUs with data parallelism = ~3.8x throughput (95% efficiency).
Tensor Parallelism
How: Split individual layers across GPUs. Each GPU computes part of every layer. Use when: Model too large for one GPU. Need low latency (all GPUs work on every request). Scaling: 2–8 GPUs (inter-GPU communication limits scaling). Example: Llama 70B across 4x A100s.
Pipeline Parallelism
How: Split model layers sequentially across GPUs. GPU 1 runs layers 1–10, GPU 2 runs layers 11–20, etc. Use when: Training very large models. Scaling: Good for training, not ideal for serving (pipeline bubbles). Example: GPT-3 training across hundreds of GPUs.
Expert Parallelism
How: For Mixture-of-Experts models, place different experts on different GPUs. Use when: Serving MoE models (Mixtral). Scaling: Matches number of experts to GPUs. Example: Mixtral 8x7B with 8 experts across 8 GPUs.
Queue-Based Architecture for Async Processing
# Queue-based architecture for handling traffic spikes
# Pattern: API → Queue → Worker Pool → Result Store
# Benefits:
# 1. Absorb traffic spikes without dropping requests
# 2. Decouple API latency from inference latency
# 3. Scale workers independently based on queue depth
# 4. Retry failed jobs automatically
Architecture:
┌─────────────┐
Client ──→ API Server ──→ Queue ──→│ GPU Worker 1 │──→ Result Store ──→ Client polls
(fast, (Redis/ │ GPU Worker 2 │ (Redis/S3)
no GPU) SQS/ │ GPU Worker 3 │
Kafka) │ ... │
└─────────────┘
Auto-scaling rules for workers:
Scale up: queue depth > 100 for 2 minutes → add 2 workers
Scale down: queue depth = 0 for 10 minutes → remove 1 worker
Min workers: 2 (always warm)
Max workers: 50 (budget cap)
Queue monitoring alerts:
WARNING: avg wait time > 30 seconds
CRITICAL: avg wait time > 2 minutes
CRITICAL: queue depth > 10,000
Real Numbers: Cost/Latency Trade-offs at Different Scales
| Scale | QPS | Architecture | GPUs | Monthly Cost | Cost/Request |
|---|---|---|---|---|---|
| Startup | 10 | Single GPU, no batching | 1x A10G | $540 | $0.021 |
| Growth | 100 | 3 GPUs + batching + caching | 3x A10G | $1,620 | $0.006 |
| Scale | 1,000 | 12 GPUs + batching + 40% cache hit | 12x A10G | $6,480 | $0.0025 |
| Enterprise | 10,000 | 50 GPUs + batching + 60% cache + multi-tier | 50x A10G | $27,000 | $0.0010 |
| Hyperscale | 100,000 | 200 GPUs + all optimizations + spot instances | 200x A10G | $72,000 | $0.0003 |
Lilly Tech Systems