Intermediate

Edge Caching for AI Inference

Cache inference results at CDN edge locations to deliver instant responses for repeated predictions and dramatically reduce GPU compute costs.

What Can Be Cached?

Not all inference is cacheable, but many AI workloads have significant cache-hit potential. The key factor is whether the same input produces the same output (determinism) and how frequently identical inputs occur.

Workload	Cacheability	Expected Hit Rate
Text embeddings	Excellent (deterministic)	40-70%
Product recommendations	Good (per user segment)	30-60%
Image classification	Good (same image = same result)	20-50%
Search ranking	Moderate (query + context)	15-40%
Chat/generation	Low (unique conversations)	5-15%

Cache Key Design

Python - Inference Cache Implementation

import hashlib, json, redis

cache = redis.Redis(host='edge-cache.example.com')

def get_cache_key(model_version, input_data):
    # Deterministic hash of model version + normalized input
    normalized = json.dumps(input_data, sort_keys=True)
    content = f"{model_version}:{normalized}"
    return f"inference:{hashlib.sha256(content.encode()).hexdigest()}"

def cached_inference(model_version, input_data):
    key = get_cache_key(model_version, input_data)
    cached = cache.get(key)
    if cached:
        return json.loads(cached)  # Cache hit

    result = run_model_inference(input_data)  # Cache miss
    cache.setex(key, 3600, json.dumps(result))  # TTL: 1 hour
    return result

Cache Invalidation Strategies

🕑

TTL-Based

Set time-to-live based on acceptable staleness. Embedding caches can live for hours; recommendation caches may need minutes.

📄

Version-Based

Include model version in the cache key. When you deploy a new model, all old cache entries automatically become misses.

🚨

Event-Driven

Purge cache entries when underlying data changes. For example, invalidate product recommendation caches when inventory updates occur.

✅

Best practice: Always include the model version in your cache key. This ensures that cache entries from an old model are never served after a model update. It also lets you run A/B tests between model versions without cache cross-contamination.

← Previous AI Model Distribution Next → Global Inference