Intermediate

Scaling AI Systems

Scaling AI is not just "add more servers." GPUs are expensive, model loading is slow, and naive scaling leads to 80% idle GPUs burning money. This lesson covers the patterns that let you handle 10x traffic without 10x cost.

Horizontal vs. Vertical Scaling for AI

Strategy	What It Means for AI	When to Use	Cost Impact
Vertical (bigger GPU)	Move from A10G (24GB) to A100 (80GB) to H100 (80GB)	Model does not fit in memory, or need faster single-request latency	2–5x cost per GPU, but may need fewer GPUs
Horizontal (more GPUs)	Add more replicas of the same model on identical GPUs	Need more throughput, model fits on single GPU	Linear cost increase with linear throughput increase
Model optimization	Quantize (FP16/INT8), distill, prune to make model smaller/faster	Always try this first — free throughput gains	Reduces cost per request by 2–4x

💡

Rule of thumb: Before adding more GPUs, try: (1) quantization (FP32 → FP16 = 2x throughput, free), (2) dynamic batching (2–5x throughput), (3) TensorRT/torch.compile optimization (1.5–3x). Only after these optimizations should you add more replicas.

Caching Strategies for AI Systems

Caching is the most cost-effective scaling strategy for AI. If you can serve 30% of requests from cache, you need 30% fewer GPUs.

Result Caching

What: Cache the full model output for a given input. Hit rate: 10–60% depending on input diversity. Best for: Recommendation APIs (same user, same context = same results for 5 min), classification of common inputs, FAQ-style chatbot queries. Implementation: Hash the input, store result in Redis with TTL.

Embedding Caching

What: Cache computed embeddings so they do not need to be recomputed. Hit rate: 80–95% for product/document embeddings. Best for: Semantic search, RAG systems, recommendation candidate retrieval. Implementation: Precompute all item embeddings, store in FAISS/Pinecone, update daily.

Prompt Caching

What: Cache the KV-cache for common prompt prefixes in LLMs. Hit rate: 40–80% when system prompts are shared. Best for: LLM applications where many requests share the same system prompt. Implementation: vLLM prefix caching, or Anthropic/OpenAI prompt caching APIs (automatic 50% cost reduction).

Feature Caching

What: Cache computed features to avoid recomputing from raw data. Hit rate: 90%+ for user features. Best for: When feature computation is expensive (complex aggregations, joins). Implementation: Feature store with online serving layer (Redis).

# Multi-layer caching for AI inference
import hashlib
from redis import Redis

class CachedInference:
    def __init__(self, model, redis_client: Redis):
        self.model = model
        self.cache = redis_client
        self.RESULT_TTL = 300      # 5 minutes for full results
        self.EMBEDDING_TTL = 86400  # 24 hours for embeddings

    async def predict(self, request):
        # Layer 1: Check result cache (exact match)
        cache_key = f"result:{self._hash(request)}"
        cached = self.cache.get(cache_key)
        if cached:
            metrics.increment("cache.result.hit")
            return json.loads(cached)

        # Layer 2: Check embedding cache (avoid recomputation)
        embedding_key = f"emb:{request.item_id}"
        cached_embedding = self.cache.get(embedding_key)
        if cached_embedding:
            metrics.increment("cache.embedding.hit")
            embedding = np.frombuffer(cached_embedding, dtype=np.float32)
        else:
            metrics.increment("cache.embedding.miss")
            embedding = self.model.encode(request.text)
            self.cache.setex(embedding_key, self.EMBEDDING_TTL,
                           embedding.tobytes())

        # Layer 3: Run inference with cached embedding
        result = self.model.rank(embedding, request.context)

        # Store in result cache
        self.cache.setex(cache_key, self.RESULT_TTL, json.dumps(result))
        return result

    def _hash(self, request) -> str:
        """Deterministic hash of request for cache key."""
        normalized = json.dumps(request.dict(), sort_keys=True)
        return hashlib.sha256(normalized.encode()).hexdigest()[:16]

Batching Inference Requests

GPUs are optimized for parallel computation. Processing one request at a time wastes 80%+ of GPU capacity. Dynamic batching groups incoming requests and processes them together.

Batch Size	Latency per Request	Throughput (req/s)	GPU Utilization	Cost per Request
1 (no batching)	10ms	100	15%	$0.0028
4	15ms	267	45%	$0.0010
8	22ms	364	65%	$0.0008
16	35ms	457	80%	$0.0006
32	55ms	582	90%	$0.0005
64	100ms	640	93%	$0.0004

Notice the trade-off: larger batches give better throughput and lower cost per request, but increase latency. Configure your batch size based on your latency budget.

💡

Dynamic batching: Instead of fixed batch sizes, use dynamic batching (available in Triton, TorchServe, vLLM). It waits up to N milliseconds to collect requests, then processes whatever has accumulated. Set max_batch_size=32 and batch_timeout_ms=10 to auto-batch without exceeding your latency budget.

Model Parallelism vs. Data Parallelism

When a single GPU is not enough — either the model does not fit in memory, or you need more throughput than one GPU can provide.

Data Parallelism

How: Same model copied to N GPUs. Each GPU processes a different batch. Use when: Model fits on one GPU, need more throughput. Scaling: Near-linear up to 8–16 GPUs. Example: 4 GPUs with data parallelism = ~3.8x throughput (95% efficiency).

Tensor Parallelism

How: Split individual layers across GPUs. Each GPU computes part of every layer. Use when: Model too large for one GPU. Need low latency (all GPUs work on every request). Scaling: 2–8 GPUs (inter-GPU communication limits scaling). Example: Llama 70B across 4x A100s.

Pipeline Parallelism

How: Split model layers sequentially across GPUs. GPU 1 runs layers 1–10, GPU 2 runs layers 11–20, etc. Use when: Training very large models. Scaling: Good for training, not ideal for serving (pipeline bubbles). Example: GPT-3 training across hundreds of GPUs.

Expert Parallelism

How: For Mixture-of-Experts models, place different experts on different GPUs. Use when: Serving MoE models (Mixtral). Scaling: Matches number of experts to GPUs. Example: Mixtral 8x7B with 8 experts across 8 GPUs.

Queue-Based Architecture for Async Processing

# Queue-based architecture for handling traffic spikes
# Pattern: API → Queue → Worker Pool → Result Store

# Benefits:
# 1. Absorb traffic spikes without dropping requests
# 2. Decouple API latency from inference latency
# 3. Scale workers independently based on queue depth
# 4. Retry failed jobs automatically

Architecture:
                                     ┌─────────────┐
  Client ──→ API Server ──→ Queue ──→│ GPU Worker 1 │──→ Result Store ──→ Client polls
              (fast,        (Redis/  │ GPU Worker 2 │    (Redis/S3)
               no GPU)      SQS/     │ GPU Worker 3 │
                            Kafka)   │     ...      │
                                     └─────────────┘

Auto-scaling rules for workers:
  Scale up:   queue depth > 100 for 2 minutes → add 2 workers
  Scale down: queue depth = 0 for 10 minutes → remove 1 worker
  Min workers: 2 (always warm)
  Max workers: 50 (budget cap)

Queue monitoring alerts:
  WARNING: avg wait time > 30 seconds
  CRITICAL: avg wait time > 2 minutes
  CRITICAL: queue depth > 10,000

Real Numbers: Cost/Latency Trade-offs at Different Scales

Scale	QPS	Architecture	GPUs	Monthly Cost	Cost/Request
Startup	10	Single GPU, no batching	1x A10G	$540	$0.021
Growth	100	3 GPUs + batching + caching	3x A10G	$1,620	$0.006
Scale	1,000	12 GPUs + batching + 40% cache hit	12x A10G	$6,480	$0.0025
Enterprise	10,000	50 GPUs + batching + 60% cache + multi-tier	50x A10G	$27,000	$0.0010
Hyperscale	100,000	200 GPUs + all optimizations + spot instances	200x A10G	$72,000	$0.0003

💡

Apply at work tomorrow: Measure your current GPU utilization. If it is below 50%, you have free throughput available through better batching. If it is above 85%, you are at risk of latency spikes under load. Target 65–75% utilization as the sweet spot between cost efficiency and headroom.

← Previous Model Serving Architecture Next → Reliability & Fault Tolerance