← home
RESEARCH · CACHING

Cache invalidation cost.

Published · 21 May 2026

By the LLM CFO team

Prompt caching dashboards love to show one number: hit rate. A green 85% bar, a friendly downward-trending cost line. What they almost never show is the cost of every miss that a deploy caused. Every system-prompt edit, every tool-schema rename, every model swap throws away the warm prefix and forces the next wave of traffic to pay full input price. That rebuild cost is the missing line item — and for teams shipping weekly, it can quietly eat most of the savings the cache was supposed to deliver.

Why invalidation cost is the missing line item

The standard way to report on a prompt cache is steady-state. You pick a 7-day or 30-day window, divide cache-read tokens by total input tokens, and call the resulting ratio the hit rate. It is a useful number. It is not the whole story.

What it hides is the shape of the miss curve over time. A cache that runs at 85% hit rate on average can spend Monday at 40% because a prompt change went out Friday afternoon and the warm prefix is being rebuilt across every tenant. Average that with a calm weekend and the dashboard still looks fine. The bill does not.

The asymmetry is the whole point of caching: a hit costs a fraction of a miss. On OpenAI, cache-read tokens are billed at roughly half the input price; on Anthropic, cache reads are closer to a tenth of base input. When that ratio flips for a few hours, you are not slightly more expensive — you are two to ten times more expensive on every affected request. If the dashboard does not separate "stable" traffic from "post-deploy" traffic, you cannot see when that flip happened or how long it lasted.

Three invalidation triggers that cost real money

Every major provider's prompt cache keys on the exact byte sequence of the prefix plus the model and a few request-level parameters. Change any of those and the cache turns over. In practice, three triggers do almost all the damage.

System-prompt edits. The most common cause and the most underestimated. A one-line tweak to a system prompt — a new instruction, a fixed typo, a tightened refusal — invalidates the cached prefix for every tenant that uses it. The next request from each tenant pays full input price for the entire system prompt plus any few-shot examples that sit before the user message. Teams that hot-edit prompts daily are paying this tax daily.

Schema and tool-definition changes. Tool-using agents are especially sensitive. The tool definitions are usually serialized at the top of the prompt, just below the system message. Rename a parameter, add an enum value, reorder the tools array — any of these changes the byte sequence and breaks the cache. Agents with twenty or thirty tools can carry several thousand tokens of definition; a schema bump can be as expensive as a system-prompt rewrite.

Model swaps. Prompt caches are scoped per-model on every major provider. Moving from a mid-tier model to a frontier model — or rolling a minor version forward — starts you cold. The same is true in reverse: routing a slice of traffic to a cheaper model to save money will, on day one, look more expensive than it should because the cache for that model is empty.

The rebuild cost math

A small example makes the size of the bill concrete. Numbers are illustrative; plug in your own prefix size and traffic.

Suppose your stable prefix is 8,000 tokens — a 2,500-token system prompt, a 3,500-token tool catalogue, and 2,000 tokens of canonical few-shot examples. You run 1,000 requests per day against it. At a representative OpenAI input price of $2.50 per million tokens and a cache-read price of $1.25 per million, the steady-state cost of the prefix is:

1,000 requests × 8,000 tokens × $1.25 / 1,000,000 = $10/day on the prefix.

Now you ship a system-prompt edit. The first request after deploy pays the full input price to write the prefix back into the cache. Provider caches are not strictly "one request rebuilds it for everyone" — most providers let the warm prefix amortize across concurrent users, but each fresh cache key still needs at least one priming request, and traffic in the first few minutes after deploy often catches the cold path before the cache propagates across replicas.

If, conservatively, the first 200 requests after deploy land cold (a realistic spread across regions, replicas, and tenants), those 200 requests pay $2.50/M instead of $1.25/M for 8,000 tokens each:

200 × 8,000 × ($2.50 − $1.25) / 1,000,000 = $2.00 extra, per deploy.

$2 looks trivial. Now multiply. Five system-prompt edits per week is $10. Add a tool-schema change every other week and a model swap a few times a year, plus the same math at multi-tenant scale where you have not one prefix but fifty (one per customer), and the same teams are looking at hundreds to low thousands of dollars per month of pure rebuild tax. Below that, increase the prefix size or the traffic and the number scales linearly. A 30,000-token agent prefix at 10,000 requests/day is in a different league.

None of this shows up as a separate row on the invoice. It shows up as input tokens that "should have been" cache-read tokens.

How often this hits in practice

The rough cadence we see across engagements:

Add them up and a team shipping at a normal pace is invalidating its prompt cache somewhere between fifty and two hundred times per year. That cadence is the thing your dashboard should be measuring against, not the average hit rate.

How to amortize the cost

You cannot eliminate invalidation — change is the point of shipping software. You can make each event cheaper.

Batch system-prompt edits. Five small edits in one release is one invalidation. Five edits across five days is five. Hold non-urgent prompt tweaks for a weekly release window and the cost drops by 4–5x without any code change.

Feature-flag and warm before flipping. When a new prompt is ready, write a tiny "warmer" path that issues a single priming request per region and per tenant with the new prompt, then flip the flag. The first real user request now hits a warm cache. Same trick for model swaps: send a synthetic priming request to the new model before you route real traffic.

Isolate volatile parts of the prompt. The cache rewards stable prefixes. Move the parts that change frequently — date stamps, A/B test instructions, per-request hints — to the suffix or to the user message. Keep the system message, tool definitions, and few-shot block byte-stable. A common pattern: a 95%-stable prefix with a small "today's instructions" block at the bottom that lives outside the cached region.

Treat the cache as a deploy artifact. Add a step to your release pipeline that warms the cache for every active tenant and every active model after a prompt change. It is the same idea as warming a CDN after a deploy. The cost of the warmer is a known, bounded line item; the cost of an unmanaged cold start is not.

Version tool definitions deliberately. If you change a tool schema, change all of them you were planning to change in the same release. Tool-catalogue churn is the most expensive form of invalidation per byte because the catalogue is usually large.

What to put on the dashboard

If you measure only steady-state hit rate, you cannot see the rebuild cost. Two panels are worth adding.

Post-deploy cache miss rate. Take each deploy as t=0 and plot the miss rate for the following 60 minutes. Compare to the rolling miss rate from the prior day. The gap is the rebuild cost in tokens. Aggregate across deploys and you have a monthly "cache tax from change" number.

Cache-read share by prefix version. Tag each prompt prefix with a version identifier and report cache-read share per version. A version that never climbs out of the cold zone is either being churned too often or has a structural problem (a non-deterministic block somewhere in the prefix) that is preventing the cache from ever taking hold.

Both panels live in the same place as the steady-state numbers, not on a separate page. The whole point is that they are part of the same cost story.

What to do next week

None of these are heavy lifts. The combined effect is usually a measurable drop in monthly LLM spend that no amount of staring at a steady-state hit rate would have surfaced.

Related

← Back to research