Intermediate
Edge Caching for AI Inference
Cache inference results at CDN edge locations to deliver instant responses for repeated predictions and dramatically reduce GPU compute costs.
What Can Be Cached?
Not all inference is cacheable, but many AI workloads have significant cache-hit potential. The key factor is whether the same input produces the same output (determinism) and how frequently identical inputs occur.
| Workload | Cacheability | Expected Hit Rate |
|---|---|---|
| Text embeddings | Excellent (deterministic) | 40-70% |
| Product recommendations | Good (per user segment) | 30-60% |
| Image classification | Good (same image = same result) | 20-50% |
| Search ranking | Moderate (query + context) | 15-40% |
| Chat/generation | Low (unique conversations) | 5-15% |
Cache Key Design
Python - Inference Cache Implementation
import hashlib, json, redis cache = redis.Redis(host='edge-cache.example.com') def get_cache_key(model_version, input_data): # Deterministic hash of model version + normalized input normalized = json.dumps(input_data, sort_keys=True) content = f"{model_version}:{normalized}" return f"inference:{hashlib.sha256(content.encode()).hexdigest()}" def cached_inference(model_version, input_data): key = get_cache_key(model_version, input_data) cached = cache.get(key) if cached: return json.loads(cached) # Cache hit result = run_model_inference(input_data) # Cache miss cache.setex(key, 3600, json.dumps(result)) # TTL: 1 hour return result
Cache Invalidation Strategies
TTL-Based
Set time-to-live based on acceptable staleness. Embedding caches can live for hours; recommendation caches may need minutes.
Version-Based
Include model version in the cache key. When you deploy a new model, all old cache entries automatically become misses.
Event-Driven
Purge cache entries when underlying data changes. For example, invalidate product recommendation caches when inventory updates occur.
Best practice: Always include the model version in your cache key. This ensures that cache entries from an old model are never served after a model update. It also lets you run A/B tests between model versions without cache cross-contamination.
Lilly Tech Systems