Prompt caching, explained.
25 April 2026
Prompt caching is the highest-ROI optimization in the playbook because it's provider-native, requires almost no code change, and discounts the most expensive token type on your bill: long, repeated input. Done right, you cut 30–60% off cache-eligible endpoints in a week.
How it works (in one paragraph)
The model's KV cache — the intermediate state computed when reading your prompt — is normally thrown away after each request. With prompt caching, the provider keeps that state for a few minutes and reuses it when your next request shares an identical prefix. You pay a small write fee the first time and a steeply discounted read fee on every subsequent hit. The output is identical to an uncached call.
What each provider charges
| Provider cache | Mechanics |
|---|---|
| OpenAI · cache-read | ~50% off input price · 5–10 min TTL · auto on prompts ≥ 1024 tokens |
| Anthropic · ephemeral cache | ~90% off input · 5 min TTL · 25% write premium · explicit `cache_control` block |
| Anthropic · extended cache | ~90% off input · 1 hour TTL · 100% write premium · explicit beta header |
| AWS Bedrock (Anthropic on Bedrock) | matches Anthropic terms · region availability varies |
| Google Vertex (Gemini) | context cache · per-token-per-hour storage fee + ~75% read discount · explicit cache resource |
How to structure prompts so the cache hits
- Static content first, dynamic content last. The cache matches a prefix. Anything before the first byte of difference is cacheable; everything after is not.
- Move tools, system prompt, and few-shot examples into the prefix. These are the largest static blocks in most production prompts.
- Pin the user message, retrieved docs, and timestamps to the suffix. They change per request anyway.
- Don't interpolate the date into the system prompt. One unstable byte per day kills a year of cache hits.
- Anthropic-only: add `cache_control: {type: "ephemeral"}` to the last block you want cached. Up to 4 break-points.
Common mistakes that silently disable caching
- Trailing whitespace from a templating engine breaking byte-equality.
- Random tool order when serializing a `tools` array — sort it.
- JSON key order drift from a non-deterministic serializer.
- User ID or session ID embedded in the system prompt for "personalization." Move it to the user message.
- Streaming tools that re-issue the full conversation instead of reusing the conversation prefix. Check your SDK.
The cache-read token baseline trap
Cache-read tokens are billed and reported separately from input tokens on Anthropic. Many teams set their pre-engagement baseline by summing all input-tokens lines and miss the cache-read line entirely — then "savings" mysteriously appear that are really just a column shift. Always reconcile against the raw provider invoice with cache-read as a distinct line item. We wrote up the trap in detail: cache-read tokens: the baseline trap.
When prompt caching doesn't help
- Single-shot prompts with no shared prefix between users.
- Hot reloads of system prompt on every deploy. Pin it.
- Prompts under the minimum token threshold (OpenAI: 1024 tokens; Anthropic: 1024 for Sonnet, 2048 for Haiku, 1024 for Opus).
- Workloads bursting then idling for hours. Use Anthropic extended (1-hr) cache or accept the cold-start cost.