Intermediate

Scaling AI Systems

Scaling AI is not just "add more servers." GPUs are expensive, model loading is slow, and naive scaling leads to 80% idle GPUs burning money. This lesson covers the patterns that let you handle 10x traffic without 10x cost.

Horizontal vs. Vertical Scaling for AI

StrategyWhat It Means for AIWhen to UseCost Impact
Vertical (bigger GPU)Move from A10G (24GB) to A100 (80GB) to H100 (80GB)Model does not fit in memory, or need faster single-request latency2–5x cost per GPU, but may need fewer GPUs
Horizontal (more GPUs)Add more replicas of the same model on identical GPUsNeed more throughput, model fits on single GPULinear cost increase with linear throughput increase
Model optimizationQuantize (FP16/INT8), distill, prune to make model smaller/fasterAlways try this first — free throughput gainsReduces cost per request by 2–4x
💡
Rule of thumb: Before adding more GPUs, try: (1) quantization (FP32 → FP16 = 2x throughput, free), (2) dynamic batching (2–5x throughput), (3) TensorRT/torch.compile optimization (1.5–3x). Only after these optimizations should you add more replicas.

Caching Strategies for AI Systems

Caching is the most cost-effective scaling strategy for AI. If you can serve 30% of requests from cache, you need 30% fewer GPUs.

Result Caching

What: Cache the full model output for a given input. Hit rate: 10–60% depending on input diversity. Best for: Recommendation APIs (same user, same context = same results for 5 min), classification of common inputs, FAQ-style chatbot queries. Implementation: Hash the input, store result in Redis with TTL.

Embedding Caching

What: Cache computed embeddings so they do not need to be recomputed. Hit rate: 80–95% for product/document embeddings. Best for: Semantic search, RAG systems, recommendation candidate retrieval. Implementation: Precompute all item embeddings, store in FAISS/Pinecone, update daily.

Prompt Caching

What: Cache the KV-cache for common prompt prefixes in LLMs. Hit rate: 40–80% when system prompts are shared. Best for: LLM applications where many requests share the same system prompt. Implementation: vLLM prefix caching, or Anthropic/OpenAI prompt caching APIs (automatic 50% cost reduction).

Feature Caching

What: Cache computed features to avoid recomputing from raw data. Hit rate: 90%+ for user features. Best for: When feature computation is expensive (complex aggregations, joins). Implementation: Feature store with online serving layer (Redis).

# Multi-layer caching for AI inference
import hashlib
from redis import Redis

class CachedInference:
    def __init__(self, model, redis_client: Redis):
        self.model = model
        self.cache = redis_client
        self.RESULT_TTL = 300      # 5 minutes for full results
        self.EMBEDDING_TTL = 86400  # 24 hours for embeddings

    async def predict(self, request):
        # Layer 1: Check result cache (exact match)
        cache_key = f"result:{self._hash(request)}"
        cached = self.cache.get(cache_key)
        if cached:
            metrics.increment("cache.result.hit")
            return json.loads(cached)

        # Layer 2: Check embedding cache (avoid recomputation)
        embedding_key = f"emb:{request.item_id}"
        cached_embedding = self.cache.get(embedding_key)
        if cached_embedding:
            metrics.increment("cache.embedding.hit")
            embedding = np.frombuffer(cached_embedding, dtype=np.float32)
        else:
            metrics.increment("cache.embedding.miss")
            embedding = self.model.encode(request.text)
            self.cache.setex(embedding_key, self.EMBEDDING_TTL,
                           embedding.tobytes())

        # Layer 3: Run inference with cached embedding
        result = self.model.rank(embedding, request.context)

        # Store in result cache
        self.cache.setex(cache_key, self.RESULT_TTL, json.dumps(result))
        return result

    def _hash(self, request) -> str:
        """Deterministic hash of request for cache key."""
        normalized = json.dumps(request.dict(), sort_keys=True)
        return hashlib.sha256(normalized.encode()).hexdigest()[:16]

Batching Inference Requests

GPUs are optimized for parallel computation. Processing one request at a time wastes 80%+ of GPU capacity. Dynamic batching groups incoming requests and processes them together.

Batch SizeLatency per RequestThroughput (req/s)GPU UtilizationCost per Request
1 (no batching)10ms10015%$0.0028
415ms26745%$0.0010
822ms36465%$0.0008
1635ms45780%$0.0006
3255ms58290%$0.0005
64100ms64093%$0.0004

Notice the trade-off: larger batches give better throughput and lower cost per request, but increase latency. Configure your batch size based on your latency budget.

💡
Dynamic batching: Instead of fixed batch sizes, use dynamic batching (available in Triton, TorchServe, vLLM). It waits up to N milliseconds to collect requests, then processes whatever has accumulated. Set max_batch_size=32 and batch_timeout_ms=10 to auto-batch without exceeding your latency budget.

Model Parallelism vs. Data Parallelism

When a single GPU is not enough — either the model does not fit in memory, or you need more throughput than one GPU can provide.

Data Parallelism

How: Same model copied to N GPUs. Each GPU processes a different batch. Use when: Model fits on one GPU, need more throughput. Scaling: Near-linear up to 8–16 GPUs. Example: 4 GPUs with data parallelism = ~3.8x throughput (95% efficiency).

Tensor Parallelism

How: Split individual layers across GPUs. Each GPU computes part of every layer. Use when: Model too large for one GPU. Need low latency (all GPUs work on every request). Scaling: 2–8 GPUs (inter-GPU communication limits scaling). Example: Llama 70B across 4x A100s.

Pipeline Parallelism

How: Split model layers sequentially across GPUs. GPU 1 runs layers 1–10, GPU 2 runs layers 11–20, etc. Use when: Training very large models. Scaling: Good for training, not ideal for serving (pipeline bubbles). Example: GPT-3 training across hundreds of GPUs.

Expert Parallelism

How: For Mixture-of-Experts models, place different experts on different GPUs. Use when: Serving MoE models (Mixtral). Scaling: Matches number of experts to GPUs. Example: Mixtral 8x7B with 8 experts across 8 GPUs.

Queue-Based Architecture for Async Processing

# Queue-based architecture for handling traffic spikes
# Pattern: API → Queue → Worker Pool → Result Store

# Benefits:
# 1. Absorb traffic spikes without dropping requests
# 2. Decouple API latency from inference latency
# 3. Scale workers independently based on queue depth
# 4. Retry failed jobs automatically

Architecture:
                                     ┌─────────────┐
  Client ──→ API Server ──→ Queue ──→│ GPU Worker 1 │──→ Result Store ──→ Client polls
              (fast,        (Redis/  │ GPU Worker 2 │    (Redis/S3)
               no GPU)      SQS/     │ GPU Worker 3 │
                            Kafka)   │     ...      │
                                     └─────────────┘

Auto-scaling rules for workers:
  Scale up:   queue depth > 100 for 2 minutes → add 2 workers
  Scale down: queue depth = 0 for 10 minutes → remove 1 worker
  Min workers: 2 (always warm)
  Max workers: 50 (budget cap)

Queue monitoring alerts:
  WARNING: avg wait time > 30 seconds
  CRITICAL: avg wait time > 2 minutes
  CRITICAL: queue depth > 10,000

Real Numbers: Cost/Latency Trade-offs at Different Scales

ScaleQPSArchitectureGPUsMonthly CostCost/Request
Startup10Single GPU, no batching1x A10G$540$0.021
Growth1003 GPUs + batching + caching3x A10G$1,620$0.006
Scale1,00012 GPUs + batching + 40% cache hit12x A10G$6,480$0.0025
Enterprise10,00050 GPUs + batching + 60% cache + multi-tier50x A10G$27,000$0.0010
Hyperscale100,000200 GPUs + all optimizations + spot instances200x A10G$72,000$0.0003
💡
Apply at work tomorrow: Measure your current GPU utilization. If it is below 50%, you have free throughput available through better batching. If it is above 85%, you are at risk of latency spikes under load. Target 65–75% utilization as the sweet spot between cost efficiency and headroom.