Cache-read tokens: the baseline trap.
25 April 2026
Cache-read tokens are billed separately from input tokens on Anthropic — and reported separately on the invoice, the API response, and most third-party dashboards. If you set a pre-engagement baseline by summing only the input-token line, you will appear to "save" 30–60% the moment caching turns on. Those savings are not real. They are a column shift. This is the single most common audit mistake we see.
How the line items lay out
An Anthropic invoice (and the `usage` block on each API response) breaks token consumption into four kinds:
| Counter | Billing meaning |
|---|---|
| input_tokens | uncached input — full price |
| cache_creation_input_tokens | cache writes — 1.25× input price (5-min) or 2× (1-hour) |
| cache_read_input_tokens | cache hits — ~0.1× input price |
| output_tokens | generated output — full output price |
Before caching is enabled, `cache_read_input_tokens` is zero and every input token is in the `input_tokens` bucket. After caching is enabled, the same prompt traffic is split: the static prefix shows up under `cache_read_input_tokens` (or `cache_creation_input_tokens` on first hit), and only the dynamic suffix is in `input_tokens`.
The trap
Pre-engagement, an analyst exports usage and writes:
baseline_spend = input_tokens × input_price + output_tokens × output_price
That formula is correct only when caching is off. The moment caching is enabled — by anyone, including a default in the SDK or a colleague's experiment — `input_tokens` drops, the formula returns a smaller number, and the dashboard reports "savings." The actual cash going to Anthropic may be unchanged or even higher (cache writes carry a premium).
Concrete example
Endpoint: a support bot, 10,000 requests/day, 8,000-token system+tools prefix, ~500-token user message. Output ~300 tokens. Sonnet 3.5 list price: $3 / M input, $15 / M output.
Day 1 — caching off:
| Uncached line item | Cost |
|---|---|
| input_tokens | 10,000 × 8,500 = 85,000,000 → $255.00 |
| output_tokens | 10,000 × 300 = 3,000,000 → $45.00 |
| total | $300.00 |
Day 2 — caching on, 95% cache-read hit on the 8,000-token prefix:
| Cached line item | Cost |
|---|---|
| input_tokens (suffix only + cold prefix) | ~9,000,000 → $27.00 |
| cache_creation_input_tokens (cold writes) | ~4,000,000 × 1.25 → $15.00 |
| cache_read_input_tokens | ~76,000,000 × 0.1 → $22.80 |
| output_tokens | 3,000,000 → $45.00 |
| true total | $109.80 |
The real saving is ~63%. But if the analyst's dashboard only tracks `input_tokens × input_price + output_tokens × output_price`, they see $72.00 and report a 76% saving. They are double-counting the discount because they never added the cache lines back in. When the CFO asks why the invoice doesn't match the dashboard, the engagement loses credibility.
How to reconcile properly
- Pull from the provider invoice or Cost Console, not from a derived table. Anthropic's console shows all four line items.
- Sum every token bucket × its own price. Not just input + output.
- Pin the baseline to a fixed time window before caching was enabled anywhere in the codebase. Audit the SDK initialization across all services — `cache_control` blocks may be live in code you didn't write.
- Recompute the baseline cost-per-task (cost ÷ business-meaningful unit: per ticket, per query, per generated row). Cost-per-task is invariant to traffic and the only number worth reporting up.
- Snapshot the four-bucket breakdown weekly during the engagement. Movement between buckets is the audit trail for why the bill changed.
OpenAI's version is subtler
OpenAI also reports cached input separately (`prompt_tokens_details.cached_tokens` on the API; a "cached input" line on the bill). The discount is ~50%, not ~90%, and the trap is smaller — but the same arithmetic applies. If your dashboard sums `prompt_tokens × input_price`, you are over-counting cost when caching is on, because `prompt_tokens` includes the cached portion at full price. Subtract `cached_tokens × input_price × 0.5` to get the real number.