RESEARCH · TECHNIQUE

Prompt caching, explained.

25 April 2026

By the LLM CFO team

Prompt caching is the highest-ROI optimization in the playbook because it's provider-native, requires almost no code change, and discounts the most expensive token type on your bill: long, repeated input. Done right, you cut 30–60% off cache-eligible endpoints in a week.

How it works (in one paragraph)

The model's KV cache — the intermediate state computed when reading your prompt — is normally thrown away after each request. With prompt caching, the provider keeps that state for a few minutes and reuses it when your next request shares an identical prefix. You pay a small write fee the first time and a steeply discounted read fee on every subsequent hit. The output is identical to an uncached call.

What each provider charges

Provider cache	Mechanics
OpenAI · cache-read	~50% off input price · 5–10 min TTL · auto on prompts ≥ 1024 tokens
Anthropic · ephemeral cache	~90% off input · 5 min TTL · 25% write premium · explicit `cache_control` block
Anthropic · extended cache	~90% off input · 1 hour TTL · 100% write premium · explicit beta header
AWS Bedrock (Anthropic on Bedrock)	matches Anthropic terms · region availability varies
Google Vertex (Gemini)	context cache · per-token-per-hour storage fee + ~75% read discount · explicit cache resource

How to structure prompts so the cache hits

Static content first, dynamic content last. The cache matches a prefix. Anything before the first byte of difference is cacheable; everything after is not.
Move tools, system prompt, and few-shot examples into the prefix. These are the largest static blocks in most production prompts.
Pin the user message, retrieved docs, and timestamps to the suffix. They change per request anyway.
Don't interpolate the date into the system prompt. One unstable byte per day kills a year of cache hits.
Anthropic-only: add `cache_control: {type: "ephemeral"}` to the last block you want cached. Up to 4 break-points.

Common mistakes that silently disable caching

Trailing whitespace from a templating engine breaking byte-equality.
Random tool order when serializing a `tools` array — sort it.
JSON key order drift from a non-deterministic serializer.
User ID or session ID embedded in the system prompt for "personalization." Move it to the user message.
Streaming tools that re-issue the full conversation instead of reusing the conversation prefix. Check your SDK.

The cache-read token baseline trap

Cache-read tokens are billed and reported separately from input tokens on Anthropic. Many teams set their pre-engagement baseline by summing all input-tokens lines and miss the cache-read line entirely — then "savings" mysteriously appear that are really just a column shift. Always reconcile against the raw provider invoice with cache-read as a distinct line item. We wrote up the trap in detail: cache-read tokens: the baseline trap.

When prompt caching doesn't help

Single-shot prompts with no shared prefix between users.
Hot reloads of system prompt on every deploy. Pin it.
Prompts under the minimum token threshold (OpenAI: 1024 tokens; Anthropic: 1024 for Sonnet, 2048 for Haiku, 1024 for Opus).
Workloads bursting then idling for hours. Use Anthropic extended (1-hr) cache or accept the cold-start cost.

← Back to llmcfo.com