Best Practices Advanced
This lesson consolidates production-ready guidelines for implementing prompt caching effectively. Follow these practices to maximize cache hit rates, minimize costs, and avoid common pitfalls.
Prompt Ordering Rules
- Static content first (system prompt, tool definitions)
- Semi-static content next (retrieved documents, examples)
- Dynamic content last (conversation history, user message)
Cache TTL Management
- Keep caches alive: For Anthropic, each cache hit refreshes the 5-minute TTL. Design your system to send requests frequently enough to keep important caches warm.
- Warm caches on startup: Send a lightweight request to create cache entries before real user traffic arrives.
- Monitor expiry patterns: Track when cache misses spike to identify TTL-related issues.
Production Checklist
| Category | Action Item | Priority |
|---|---|---|
| Architecture | Structure prompts with static content first | Critical |
| Architecture | Version-control system prompts to prevent accidental changes | High |
| Monitoring | Track cache hit rate, tokens saved, and cost savings | High |
| Monitoring | Alert when cache hit rate drops below threshold | Medium |
| Cost | Calculate expected vs actual savings weekly | High |
| Reliability | Implement cache warming on application startup | Medium |
| Testing | Verify cache behavior in staging before production | High |
Common Mistakes
Timestamps in Prompts
Including "Current date: March 15, 2026" in your system prompt changes it every day, invalidating the cache. Move dynamic data to the user message.
Randomized Examples
Shuffling few-shot examples changes the prefix each time. Use a fixed order for examples, or place them after a cache breakpoint.
Below Minimum Threshold
If your cacheable content is below 1,024 tokens, caching will not activate. Combine short prompts or add relevant static context to meet the threshold.
Ignoring Cache Metrics
Not monitoring cache_read_input_tokens means you have no visibility into whether caching is actually working. Always log and track these values.
Debugging Cache Misses
# If cache_read_input_tokens is 0:
1. Check: Is cacheable content above minimum threshold?
- Claude Sonnet/Haiku: 1,024 tokens
- Claude Opus: 2,048 tokens
2. Check: Has the prompt prefix changed since last request?
- Compare exact bytes of system prompt
- Look for whitespace changes, encoding differences
3. Check: Has the cache expired?
- TTL is 5 minutes for Anthropic
- Ensure requests are frequent enough
4. Check: Are you using the same model?
- Cache entries are model-specific
5. Check: Did you include cache_control? (Anthropic only)
- Verify cache_control is on the right content block
Lilly Tech Systems