REFERENCE
Glossary.
25 April 2026
Plain-English definitions of the terms that show up on AI bills and in optimization conversations. Where a term has a deeper write-up, the link goes there.
A
- AI FinOps
- The practice of governing AI / LLM spend the way cloud FinOps governs cloud spend: visibility, attribution, baselines, optimization, and accountability. See the cost optimization guide.
- Anthropic
- Maker of the Claude family of models (Haiku, Sonnet, Opus). First-party API and also resold via AWS Bedrock and Google Vertex AI. Bills cache writes, cache reads, input, and output as separate line items — see the baseline trap.
- Azure OpenAI
- Microsoft's enterprise reseller of OpenAI models. Same weights as the OpenAI API, but with EA pricing, regional residency, and Microsoft commercial terms. Sometimes a meaningful price win versus OpenAI direct depending on your Microsoft EA.
B
- Batch API
- An async inference mode (OpenAI, Anthropic, Bedrock, Vertex) that discounts input and output by ~50% in exchange for a 24-hour completion SLA. The right home for evals, enrichment, and nightly jobs. See batch API routing.
- Bedrock
- AWS's managed inference service. Hosts Anthropic, Meta Llama, Mistral, Cohere, Amazon Nova, and others. Same-VPC inference, IAM auth, and reserved capacity make it attractive for AWS-resident workloads. See provider arbitrage.
C
- Cache-read tokens
- Input tokens served from a prompt cache hit. Discounted ~50% (OpenAI) or ~90% (Anthropic) versus full-price input. Reported as a separate invoice line — easy to miscount when setting a baseline. See the baseline trap.
- Cache-write tokens
- Input tokens that populated a cache entry on the first request. On Anthropic, billed at 1.25× (5-min ephemeral) or 2× (1-hour extended) the input rate. The premium pays back fast on any reused prefix. See prompt caching.
- Cascade routing
- A model-routing pattern: try a cheap model first; only escalate to an expensive model if a cheap classifier or self-confidence check says the cheap answer is unreliable. See model routing.
- Context caching (Vertex)
- Google's prompt-cache equivalent on Vertex Gemini. Charges a per-token-per-hour storage fee plus a discounted read rate. Slightly different mental model than OpenAI/Anthropic — you provision a cache resource explicitly.
- Cosine similarity
- A measure of how close two embedding vectors are, in the range -1 to 1. The standard threshold knob in semantic caching: hits above the threshold reuse a prior response. See semantic caching.
- Cost per task
- Spend divided by a business-meaningful unit (per ticket resolved, per row enriched, per query answered). The only KPI worth reporting up — invariant to traffic and aligned to value.
D
- Distillation
- Training a smaller "student" model to mimic a larger "teacher" model on a specific task. Done well, you get most of the quality at a fraction of the inference cost. The cost of training and eval is non-trivial — only worth it for high-volume, narrow tasks.
E
- Embedding
- A dense vector representation of a piece of text. The basis for retrieval, semantic search, and semantic caching. Embedding model spend is usually a small fraction of total LLM spend but underpins a lot of the optimization stack.
- Ephemeral cache
- Anthropic's short-TTL prompt cache (~5 minutes). Default option. Cheaper write premium (1.25×). Right choice for most workloads. See prompt caching.
- Extended cache
- Anthropic's long-TTL prompt cache (~1 hour). Higher write premium (2×). Right choice for bursty workloads with long idle gaps.
F
- Fine-tuning
- Adapting a base model's weights on your data. Lowers per-call cost on smaller fine-tuned models and can lift quality on narrow tasks; adds training cost, eval cost, and lifecycle risk. Often less attractive than prompt engineering plus caching for the first ~$100k of spend.
H
- Helicone
- A logging proxy for LLM API calls. Drop-in via base-URL swap; surfaces request history, cost attribution, and a built-in exact-match cache. See the comparison.
I
- Inference cost
- The cost of running a model to produce an output, billed in tokens (input + output, with caching variants). Distinct from training cost. The vast majority of an AI product's bill is inference.
K
- KV cache
- Key/Value cache — the intermediate attention state a transformer builds while reading the prompt. Provider prompt caches expose this internal optimization as a billable, reusable resource. See prompt caching.
L
- LangFuse
- An open-source LLM observability platform. Traces, evals, prompt management, dataset curation. Not a gateway. See the comparison.
- LiteLLM
- An open-source multi-provider gateway / SDK that unifies ~100 providers under the OpenAI Chat Completions schema. Routing, fallback, virtual keys, budgets. See the comparison.
- LLM-as-judge
- Using one LLM to score the outputs of another, typically against a rubric. The default eval methodology when there's no exact-match ground truth. Cheap, noisy, and reasonable when the judge model is stronger than the candidate.
- Logprobs
- Log-probabilities the model assigned to each generated token. Useful for confidence-based routing, classification thresholds, and debugging structured-output failures.
M
- Model router
- A component that decides which model handles each request, usually based on task type, expected difficulty, or a confidence signal. The biggest single cost lever after caching. See model routing.
O
- OpenAI
- Maker of the GPT family. First-party API; also resold as Azure OpenAI. Cache-read tokens reported under
prompt_tokens_details.cached_tokens; ~50% discount on cached input. - OpenRouter
- A meta-router and price-discovery layer in front of dozens of model hosts. Useful for benchmarking the going rate on any given open-weights model. See provider arbitrage.
P
- Prefill
- The first phase of inference: reading the prompt and computing its KV cache. Fast and parallel. The phase prompt caching short-circuits.
- Prompt cache
- Provider-native reuse of the KV cache for shared prompt prefixes. Highest-ROI optimization in the playbook. See prompt caching.
- Prompt compression
- Shrinking a prompt without changing its meaning — trimming few-shot examples, summarizing retrieved context, replacing verbose instructions. Real savings, but eval before shipping; small wording changes can move quality.
- Provider arbitrage
- Buying the same (or quality-equivalent) model from a cheaper host. See provider arbitrage.
R
- RAG
- Retrieval-Augmented Generation. Fetch relevant passages from a corpus, stuff them in the prompt, generate an answer grounded in the retrieved context. The dominant pattern for knowledge-base assistants.
- Reasoning tokens
- Internal chain-of-thought tokens that some "reasoning" models (o-series, thinking-mode Claude) produce before the visible answer. Billed at the output rate. Easy to triple your bill if reasoning effort is set too high.
- Retrieval-augmented generation
- See RAG.
S
- Semantic cache
- A cache keyed on embedding similarity rather than byte-equality. Returns a prior answer when a new request is "close enough" to a stored one. Real savings on stable RAG / classification workloads, real risk on generative ones. See semantic caching.
- Speculative decoding
- An inference-time speedup where a small draft model proposes tokens that a larger model verifies in parallel. Reduces latency more than cost on managed APIs; mostly relevant if you self-host.
- System prompt
- The instruction block that frames the model's role and rules. The largest cacheable static chunk in most production prompts — keep it stable to keep cache hits.
T
- Token
- The unit of billing. Roughly 0.75 words for English, but varies by tokenizer. Input tokens, output tokens, cache-read tokens, cache-write tokens, and reasoning tokens are all priced differently.
- Token leak
- Spend on tokens that aren't producing user value: oversized retrieved context, runaway reasoning, repeated system prompts that should be cached, debug prints in production prompts. The unglamorous half of every audit.
- Tool calls
- Structured function-invocation messages a model emits to call external systems. Each round-trip is billed; agent loops with poor termination logic are a classic source of token leak.
V
- Vertex AI
- Google Cloud's managed inference platform. Hosts Gemini, Claude (via partnership), Llama, Mistral, and others. Different pricing tier and quota model than AI Studio. See provider arbitrage.
- vLLM
- An open-source high-throughput inference server. Relevant if you're self-hosting open-weights models on your own GPUs; not relevant for pure managed-API users.