Implementation Patterns Advanced

Effective prompt caching requires thoughtful prompt architecture. This lesson covers production-ready patterns for structuring prompts, warming caches, monitoring performance, and optimizing multi-turn conversations.

Prompt Structure for Maximum Caching

The golden rule: put static content first, dynamic content last.

Optimal Prompt Structure

# OPTIMAL ORDER (most cacheable to least):
1. System prompt (static)          ← Always cached
2. Tool definitions (static)       ← Always cached
3. Few-shot examples (static)      ← Always cached
4. Document context (semi-static)  ← Often cached
5. Conversation history (growing)  ← Partially cached
6. Current user message (dynamic)  ← Never cached

Cache Warming Strategy

For Anthropic's explicit caching, you can "warm" the cache by sending a lightweight request that creates the cache entry before real traffic arrives:

Python

import anthropic

client = anthropic.Anthropic()

def warm_cache(system_prompt):
    """Pre-warm the cache with a minimal request."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1,  # Minimal output
        system=[{
            "type": "text",
            "text": system_prompt,
            "cache_control": {"type": "ephemeral"}
        }],
        messages=[{"role": "user", "content": "hi"}]
    )
    print(f"Cache warmed: {response.usage.cache_creation_input_tokens} tokens")

# Call this during app startup or on a schedule
warm_cache(my_long_system_prompt)

Monitoring Cache Performance

Python

class CacheMonitor:
    def __init__(self):
        self.total_requests = 0
        self.cache_hits = 0
        self.tokens_saved = 0

    def track(self, usage):
        self.total_requests += 1
        cached = getattr(usage, 'cache_read_input_tokens', 0)
        if cached > 0:
            self.cache_hits += 1
            self.tokens_saved += cached

    @property
    def hit_rate(self):
        if self.total_requests == 0:
            return 0
        return self.cache_hits / self.total_requests * 100

    def report(self):
        print(f"Cache hit rate: {self.hit_rate:.1f}%")
        print(f"Tokens saved:   {self.tokens_saved:,}")

monitor = CacheMonitor()

Multi-Turn Conversation Pattern

For multi-turn conversations, use a "sliding cache window" that caches the growing conversation history:

Python

def build_messages_with_cache(history, new_message):
    """Build message list with cache breakpoint on history."""
    messages = []

    # Add all history messages
    for i, msg in enumerate(history):
        if i == len(history) - 1:
            # Cache breakpoint on last history message
            messages.append({
                "role": msg["role"],
                "content": [{
                    "type": "text",
                    "text": msg["content"],
                    "cache_control": {"type": "ephemeral"}
                }]
            })
        else:
            messages.append(msg)

    # Add new user message (not cached)
    messages.append({"role": "user", "content": new_message})
    return messages

Cache Invalidation: Any change to the cached prefix invalidates the cache. Even adding a single character to your system prompt creates a new cache entry. Use version-controlled prompts to avoid accidental invalidation.

← Cost Savings Best Practices →