Implementation Patterns Advanced
Effective prompt caching requires thoughtful prompt architecture. This lesson covers production-ready patterns for structuring prompts, warming caches, monitoring performance, and optimizing multi-turn conversations.
Prompt Structure for Maximum Caching
The golden rule: put static content first, dynamic content last.
Optimal Prompt Structure
# OPTIMAL ORDER (most cacheable to least):
1. System prompt (static) ← Always cached
2. Tool definitions (static) ← Always cached
3. Few-shot examples (static) ← Always cached
4. Document context (semi-static) ← Often cached
5. Conversation history (growing) ← Partially cached
6. Current user message (dynamic) ← Never cached
Cache Warming Strategy
For Anthropic's explicit caching, you can "warm" the cache by sending a lightweight request that creates the cache entry before real traffic arrives:
Python
import anthropic client = anthropic.Anthropic() def warm_cache(system_prompt): """Pre-warm the cache with a minimal request.""" response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1, # Minimal output system=[{ "type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"} }], messages=[{"role": "user", "content": "hi"}] ) print(f"Cache warmed: {response.usage.cache_creation_input_tokens} tokens") # Call this during app startup or on a schedule warm_cache(my_long_system_prompt)
Monitoring Cache Performance
Python
class CacheMonitor: def __init__(self): self.total_requests = 0 self.cache_hits = 0 self.tokens_saved = 0 def track(self, usage): self.total_requests += 1 cached = getattr(usage, 'cache_read_input_tokens', 0) if cached > 0: self.cache_hits += 1 self.tokens_saved += cached @property def hit_rate(self): if self.total_requests == 0: return 0 return self.cache_hits / self.total_requests * 100 def report(self): print(f"Cache hit rate: {self.hit_rate:.1f}%") print(f"Tokens saved: {self.tokens_saved:,}") monitor = CacheMonitor()
Multi-Turn Conversation Pattern
For multi-turn conversations, use a "sliding cache window" that caches the growing conversation history:
Python
def build_messages_with_cache(history, new_message): """Build message list with cache breakpoint on history.""" messages = [] # Add all history messages for i, msg in enumerate(history): if i == len(history) - 1: # Cache breakpoint on last history message messages.append({ "role": msg["role"], "content": [{ "type": "text", "text": msg["content"], "cache_control": {"type": "ephemeral"} }] }) else: messages.append(msg) # Add new user message (not cached) messages.append({"role": "user", "content": new_message}) return messages
Cache Invalidation: Any change to the cached prefix invalidates the cache. Even adding a single character to your system prompt creates a new cache entry. Use version-controlled prompts to avoid accidental invalidation.
Lilly Tech Systems