← home
RESEARCH · TECHNIQUE

Prompt caching, explained.

25 April 2026

By the LLM CFO team

Prompt caching is the highest-ROI optimization in the playbook because it's provider-native, requires almost no code change, and discounts the most expensive token type on your bill: long, repeated input. Done right, you cut 30–60% off cache-eligible endpoints in a week.

How it works (in one paragraph)

The model's KV cache — the intermediate state computed when reading your prompt — is normally thrown away after each request. With prompt caching, the provider keeps that state for a few minutes and reuses it when your next request shares an identical prefix. You pay a small write fee the first time and a steeply discounted read fee on every subsequent hit. The output is identical to an uncached call.

What each provider charges

Provider cacheMechanics
OpenAI · cache-read~50% off input price · 5–10 min TTL · auto on prompts ≥ 1024 tokens
Anthropic · ephemeral cache~90% off input · 5 min TTL · 25% write premium · explicit `cache_control` block
Anthropic · extended cache~90% off input · 1 hour TTL · 100% write premium · explicit beta header
AWS Bedrock (Anthropic on Bedrock)matches Anthropic terms · region availability varies
Google Vertex (Gemini)context cache · per-token-per-hour storage fee + ~75% read discount · explicit cache resource

How to structure prompts so the cache hits

  1. Static content first, dynamic content last. The cache matches a prefix. Anything before the first byte of difference is cacheable; everything after is not.
  2. Move tools, system prompt, and few-shot examples into the prefix. These are the largest static blocks in most production prompts.
  3. Pin the user message, retrieved docs, and timestamps to the suffix. They change per request anyway.
  4. Don't interpolate the date into the system prompt. One unstable byte per day kills a year of cache hits.
  5. Anthropic-only: add `cache_control: {type: "ephemeral"}` to the last block you want cached. Up to 4 break-points.

Common mistakes that silently disable caching

The cache-read token baseline trap

Cache-read tokens are billed and reported separately from input tokens on Anthropic. Many teams set their pre-engagement baseline by summing all input-tokens lines and miss the cache-read line entirely — then "savings" mysteriously appear that are really just a column shift. Always reconcile against the raw provider invoice with cache-read as a distinct line item. We wrote up the trap in detail: cache-read tokens: the baseline trap.

When prompt caching doesn't help

Related

← Back to llmcfo.com