Best Practices Advanced

This lesson consolidates production-ready guidelines for implementing prompt caching effectively. Follow these practices to maximize cache hit rates, minimize costs, and avoid common pitfalls.

Prompt Ordering Rules

The Caching Order Principle:
  1. Static content first (system prompt, tool definitions)
  2. Semi-static content next (retrieved documents, examples)
  3. Dynamic content last (conversation history, user message)
Any change to the prefix invalidates everything after it. Stable content must come before variable content.

Cache TTL Management

  • Keep caches alive: For Anthropic, each cache hit refreshes the 5-minute TTL. Design your system to send requests frequently enough to keep important caches warm.
  • Warm caches on startup: Send a lightweight request to create cache entries before real user traffic arrives.
  • Monitor expiry patterns: Track when cache misses spike to identify TTL-related issues.

Production Checklist

Category Action Item Priority
Architecture Structure prompts with static content first Critical
Architecture Version-control system prompts to prevent accidental changes High
Monitoring Track cache hit rate, tokens saved, and cost savings High
Monitoring Alert when cache hit rate drops below threshold Medium
Cost Calculate expected vs actual savings weekly High
Reliability Implement cache warming on application startup Medium
Testing Verify cache behavior in staging before production High

Common Mistakes

Timestamps in Prompts

Including "Current date: March 15, 2026" in your system prompt changes it every day, invalidating the cache. Move dynamic data to the user message.

Randomized Examples

Shuffling few-shot examples changes the prefix each time. Use a fixed order for examples, or place them after a cache breakpoint.

Below Minimum Threshold

If your cacheable content is below 1,024 tokens, caching will not activate. Combine short prompts or add relevant static context to meet the threshold.

Ignoring Cache Metrics

Not monitoring cache_read_input_tokens means you have no visibility into whether caching is actually working. Always log and track these values.

Debugging Cache Misses

Debugging Checklist
# If cache_read_input_tokens is 0:

1. Check: Is cacheable content above minimum threshold?
   - Claude Sonnet/Haiku: 1,024 tokens
   - Claude Opus: 2,048 tokens

2. Check: Has the prompt prefix changed since last request?
   - Compare exact bytes of system prompt
   - Look for whitespace changes, encoding differences

3. Check: Has the cache expired?
   - TTL is 5 minutes for Anthropic
   - Ensure requests are frequent enough

4. Check: Are you using the same model?
   - Cache entries are model-specific

5. Check: Did you include cache_control? (Anthropic only)
   - Verify cache_control is on the right content block
Final Tip: Start simple. Add caching to your system prompt first, measure the savings, then expand to tool definitions and conversation history. Incremental adoption lets you validate savings at each step.