← home
RESEARCH · TECHNIQUE

Semantic caching.

25 April 2026

Semantic caching embeds an incoming request, looks up the nearest prior request above a similarity threshold, and returns that prior response instead of calling the model. When it works, it cuts spend 20–40% on the affected endpoint with zero added latency. When it doesn't, it returns wrong answers — and you get told about it on Twitter.

Where it works

Where it breaks

Picking a similarity threshold

This is the lever that destroys quality if you set it wrong. Defaults we use as a starting point (cosine similarity on `text-embedding-3-small`):

RAG retrieval0.95
Classification0.92
Customer support0.97
Boilerplate0.90

Tune by sampling 200 cache hits per endpoint, judging each pair (input, served-from-cache response) with an LLM-as-judge + human spot-check. If precision drops below 95%, raise the threshold.

Implementation cost

Embedding tokens are not free. At `$0.02 / 1M tokens` for `text-embedding-3-small`, the embedding cost is negligible vs. a frontier-model call — but if you embed every request and your hit rate is 3%, you're paying for embeddings without the savings to justify them. Measure hit rate before scaling.

Stack we typically use

Related

← Back to llmcfo.com