← home
REFERENCE

Glossary.

25 April 2026

By the LLM CFO team

Plain-English definitions of the terms that show up on AI bills and in optimization conversations. Where a term has a deeper write-up, the link goes there.

ABCDEFHIKLMOPRSTV

A

AI FinOps
The practice of governing AI / LLM spend the way cloud FinOps governs cloud spend: visibility, attribution, baselines, optimization, and accountability. See the cost optimization guide.
Anthropic
Maker of the Claude family of models (Haiku, Sonnet, Opus). First-party API and also resold via AWS Bedrock and Google Vertex AI. Bills cache writes, cache reads, input, and output as separate line items — see the baseline trap.
Azure OpenAI
Microsoft's enterprise reseller of OpenAI models. Same weights as the OpenAI API, but with EA pricing, regional residency, and Microsoft commercial terms. Sometimes a meaningful price win versus OpenAI direct depending on your Microsoft EA.

B

Batch API
An async inference mode (OpenAI, Anthropic, Bedrock, Vertex) that discounts input and output by ~50% in exchange for a 24-hour completion SLA. The right home for evals, enrichment, and nightly jobs. See batch API routing.
Bedrock
AWS's managed inference service. Hosts Anthropic, Meta Llama, Mistral, Cohere, Amazon Nova, and others. Same-VPC inference, IAM auth, and reserved capacity make it attractive for AWS-resident workloads. See provider arbitrage.

C

Cache-read tokens
Input tokens served from a prompt cache hit. Discounted ~50% (OpenAI) or ~90% (Anthropic) versus full-price input. Reported as a separate invoice line — easy to miscount when setting a baseline. See the baseline trap.
Cache-write tokens
Input tokens that populated a cache entry on the first request. On Anthropic, billed at 1.25× (5-min ephemeral) or 2× (1-hour extended) the input rate. The premium pays back fast on any reused prefix. See prompt caching.
Cascade routing
A model-routing pattern: try a cheap model first; only escalate to an expensive model if a cheap classifier or self-confidence check says the cheap answer is unreliable. See model routing.
Context caching (Vertex)
Google's prompt-cache equivalent on Vertex Gemini. Charges a per-token-per-hour storage fee plus a discounted read rate. Slightly different mental model than OpenAI/Anthropic — you provision a cache resource explicitly.
Cosine similarity
A measure of how close two embedding vectors are, in the range -1 to 1. The standard threshold knob in semantic caching: hits above the threshold reuse a prior response. See semantic caching.
Cost per task
Spend divided by a business-meaningful unit (per ticket resolved, per row enriched, per query answered). The only KPI worth reporting up — invariant to traffic and aligned to value.

D

Distillation
Training a smaller "student" model to mimic a larger "teacher" model on a specific task. Done well, you get most of the quality at a fraction of the inference cost. The cost of training and eval is non-trivial — only worth it for high-volume, narrow tasks.

E

Embedding
A dense vector representation of a piece of text. The basis for retrieval, semantic search, and semantic caching. Embedding model spend is usually a small fraction of total LLM spend but underpins a lot of the optimization stack.
Ephemeral cache
Anthropic's short-TTL prompt cache (~5 minutes). Default option. Cheaper write premium (1.25×). Right choice for most workloads. See prompt caching.
Extended cache
Anthropic's long-TTL prompt cache (~1 hour). Higher write premium (2×). Right choice for bursty workloads with long idle gaps.

F

Fine-tuning
Adapting a base model's weights on your data. Lowers per-call cost on smaller fine-tuned models and can lift quality on narrow tasks; adds training cost, eval cost, and lifecycle risk. Often less attractive than prompt engineering plus caching for the first ~$100k of spend.

H

Helicone
A logging proxy for LLM API calls. Drop-in via base-URL swap; surfaces request history, cost attribution, and a built-in exact-match cache. See the comparison.

I

Inference cost
The cost of running a model to produce an output, billed in tokens (input + output, with caching variants). Distinct from training cost. The vast majority of an AI product's bill is inference.

K

KV cache
Key/Value cache — the intermediate attention state a transformer builds while reading the prompt. Provider prompt caches expose this internal optimization as a billable, reusable resource. See prompt caching.

L

LangFuse
An open-source LLM observability platform. Traces, evals, prompt management, dataset curation. Not a gateway. See the comparison.
LiteLLM
An open-source multi-provider gateway / SDK that unifies ~100 providers under the OpenAI Chat Completions schema. Routing, fallback, virtual keys, budgets. See the comparison.
LLM-as-judge
Using one LLM to score the outputs of another, typically against a rubric. The default eval methodology when there's no exact-match ground truth. Cheap, noisy, and reasonable when the judge model is stronger than the candidate.
Logprobs
Log-probabilities the model assigned to each generated token. Useful for confidence-based routing, classification thresholds, and debugging structured-output failures.

M

Model router
A component that decides which model handles each request, usually based on task type, expected difficulty, or a confidence signal. The biggest single cost lever after caching. See model routing.

O

OpenAI
Maker of the GPT family. First-party API; also resold as Azure OpenAI. Cache-read tokens reported under prompt_tokens_details.cached_tokens; ~50% discount on cached input.
OpenRouter
A meta-router and price-discovery layer in front of dozens of model hosts. Useful for benchmarking the going rate on any given open-weights model. See provider arbitrage.

P

Prefill
The first phase of inference: reading the prompt and computing its KV cache. Fast and parallel. The phase prompt caching short-circuits.
Prompt cache
Provider-native reuse of the KV cache for shared prompt prefixes. Highest-ROI optimization in the playbook. See prompt caching.
Prompt compression
Shrinking a prompt without changing its meaning — trimming few-shot examples, summarizing retrieved context, replacing verbose instructions. Real savings, but eval before shipping; small wording changes can move quality.
Provider arbitrage
Buying the same (or quality-equivalent) model from a cheaper host. See provider arbitrage.

R

RAG
Retrieval-Augmented Generation. Fetch relevant passages from a corpus, stuff them in the prompt, generate an answer grounded in the retrieved context. The dominant pattern for knowledge-base assistants.
Reasoning tokens
Internal chain-of-thought tokens that some "reasoning" models (o-series, thinking-mode Claude) produce before the visible answer. Billed at the output rate. Easy to triple your bill if reasoning effort is set too high.
Retrieval-augmented generation
See RAG.

S

Semantic cache
A cache keyed on embedding similarity rather than byte-equality. Returns a prior answer when a new request is "close enough" to a stored one. Real savings on stable RAG / classification workloads, real risk on generative ones. See semantic caching.
Speculative decoding
An inference-time speedup where a small draft model proposes tokens that a larger model verifies in parallel. Reduces latency more than cost on managed APIs; mostly relevant if you self-host.
System prompt
The instruction block that frames the model's role and rules. The largest cacheable static chunk in most production prompts — keep it stable to keep cache hits.

T

Token
The unit of billing. Roughly 0.75 words for English, but varies by tokenizer. Input tokens, output tokens, cache-read tokens, cache-write tokens, and reasoning tokens are all priced differently.
Token leak
Spend on tokens that aren't producing user value: oversized retrieved context, runaway reasoning, repeated system prompts that should be cached, debug prints in production prompts. The unglamorous half of every audit.
Tool calls
Structured function-invocation messages a model emits to call external systems. Each round-trip is billed; agent loops with poor termination logic are a classic source of token leak.

V

Vertex AI
Google Cloud's managed inference platform. Hosts Gemini, Claude (via partnership), Llama, Mistral, and others. Different pricing tier and quota model than AI Studio. See provider arbitrage.
vLLM
An open-source high-throughput inference server. Relevant if you're self-hosting open-weights models on your own GPUs; not relevant for pure managed-API users.

Related

← Back to llmcfo.com