REFERENCE

Glossary.

25 April 2026

By the LLM CFO team

Plain-English definitions of the terms that show up on AI bills and in optimization conversations. Where a term has a deeper write-up, the link goes there.

A B C D E F H I K L M O P R S T V

A

AI FinOps: The practice of governing AI / LLM spend the way cloud FinOps governs cloud spend: visibility, attribution, baselines, optimization, and accountability. See the cost optimization guide.
Anthropic: Maker of the Claude family of models (Haiku, Sonnet, Opus). First-party API and also resold via AWS Bedrock and Google Vertex AI. Bills cache writes, cache reads, input, and output as separate line items — see the baseline trap.
Azure OpenAI: Microsoft's enterprise reseller of OpenAI models. Same weights as the OpenAI API, but with EA pricing, regional residency, and Microsoft commercial terms. Sometimes a meaningful price win versus OpenAI direct depending on your Microsoft EA.

B

Batch API: An async inference mode (OpenAI, Anthropic, Bedrock, Vertex) that discounts input and output by ~50% in exchange for a 24-hour completion SLA. The right home for evals, enrichment, and nightly jobs. See batch API routing.
Bedrock: AWS's managed inference service. Hosts Anthropic, Meta Llama, Mistral, Cohere, Amazon Nova, and others. Same-VPC inference, IAM auth, and reserved capacity make it attractive for AWS-resident workloads. See provider arbitrage.

C

Cache-read tokens: Input tokens served from a prompt cache hit. Discounted ~50% (OpenAI) or ~90% (Anthropic) versus full-price input. Reported as a separate invoice line — easy to miscount when setting a baseline. See the baseline trap.
Cache-write tokens: Input tokens that populated a cache entry on the first request. On Anthropic, billed at 1.25× (5-min ephemeral) or 2× (1-hour extended) the input rate. The premium pays back fast on any reused prefix. See prompt caching.
Cascade routing: A model-routing pattern: try a cheap model first; only escalate to an expensive model if a cheap classifier or self-confidence check says the cheap answer is unreliable. See model routing.
Context caching (Vertex): Google's prompt-cache equivalent on Vertex Gemini. Charges a per-token-per-hour storage fee plus a discounted read rate. Slightly different mental model than OpenAI/Anthropic — you provision a cache resource explicitly.
Cosine similarity: A measure of how close two embedding vectors are, in the range -1 to 1. The standard threshold knob in semantic caching: hits above the threshold reuse a prior response. See semantic caching.
Cost per task: Spend divided by a business-meaningful unit (per ticket resolved, per row enriched, per query answered). The only KPI worth reporting up — invariant to traffic and aligned to value.

D

Distillation: Training a smaller "student" model to mimic a larger "teacher" model on a specific task. Done well, you get most of the quality at a fraction of the inference cost. The cost of training and eval is non-trivial — only worth it for high-volume, narrow tasks.

E

Embedding: A dense vector representation of a piece of text. The basis for retrieval, semantic search, and semantic caching. Embedding model spend is usually a small fraction of total LLM spend but underpins a lot of the optimization stack.
Ephemeral cache: Anthropic's short-TTL prompt cache (~5 minutes). Default option. Cheaper write premium (1.25×). Right choice for most workloads. See prompt caching.
Extended cache: Anthropic's long-TTL prompt cache (~1 hour). Higher write premium (2×). Right choice for bursty workloads with long idle gaps.

F

Fine-tuning: Adapting a base model's weights on your data. Lowers per-call cost on smaller fine-tuned models and can lift quality on narrow tasks; adds training cost, eval cost, and lifecycle risk. Often less attractive than prompt engineering plus caching for the first ~$100k of spend.

H

Helicone: A logging proxy for LLM API calls. Drop-in via base-URL swap; surfaces request history, cost attribution, and a built-in exact-match cache. See the comparison.

I

Inference cost: The cost of running a model to produce an output, billed in tokens (input + output, with caching variants). Distinct from training cost. The vast majority of an AI product's bill is inference.

K

KV cache: Key/Value cache — the intermediate attention state a transformer builds while reading the prompt. Provider prompt caches expose this internal optimization as a billable, reusable resource. See prompt caching.

L

LangFuse: An open-source LLM observability platform. Traces, evals, prompt management, dataset curation. Not a gateway. See the comparison.
LiteLLM: An open-source multi-provider gateway / SDK that unifies ~100 providers under the OpenAI Chat Completions schema. Routing, fallback, virtual keys, budgets. See the comparison.
LLM-as-judge: Using one LLM to score the outputs of another, typically against a rubric. The default eval methodology when there's no exact-match ground truth. Cheap, noisy, and reasonable when the judge model is stronger than the candidate.
Logprobs: Log-probabilities the model assigned to each generated token. Useful for confidence-based routing, classification thresholds, and debugging structured-output failures.

M

Model router: A component that decides which model handles each request, usually based on task type, expected difficulty, or a confidence signal. The biggest single cost lever after caching. See model routing.

O

OpenAI: Maker of the GPT family. First-party API; also resold as Azure OpenAI. Cache-read tokens reported under prompt_tokens_details.cached_tokens; ~50% discount on cached input.
OpenRouter: A meta-router and price-discovery layer in front of dozens of model hosts. Useful for benchmarking the going rate on any given open-weights model. See provider arbitrage.

P

Prefill: The first phase of inference: reading the prompt and computing its KV cache. Fast and parallel. The phase prompt caching short-circuits.
Prompt cache: Provider-native reuse of the KV cache for shared prompt prefixes. Highest-ROI optimization in the playbook. See prompt caching.
Prompt compression: Shrinking a prompt without changing its meaning — trimming few-shot examples, summarizing retrieved context, replacing verbose instructions. Real savings, but eval before shipping; small wording changes can move quality.
Provider arbitrage: Buying the same (or quality-equivalent) model from a cheaper host. See provider arbitrage.

R

RAG: Retrieval-Augmented Generation. Fetch relevant passages from a corpus, stuff them in the prompt, generate an answer grounded in the retrieved context. The dominant pattern for knowledge-base assistants.
Reasoning tokens: Internal chain-of-thought tokens that some "reasoning" models (o-series, thinking-mode Claude) produce before the visible answer. Billed at the output rate. Easy to triple your bill if reasoning effort is set too high.
Retrieval-augmented generation: See RAG.

S

Semantic cache: A cache keyed on embedding similarity rather than byte-equality. Returns a prior answer when a new request is "close enough" to a stored one. Real savings on stable RAG / classification workloads, real risk on generative ones. See semantic caching.
Speculative decoding: An inference-time speedup where a small draft model proposes tokens that a larger model verifies in parallel. Reduces latency more than cost on managed APIs; mostly relevant if you self-host.
System prompt: The instruction block that frames the model's role and rules. The largest cacheable static chunk in most production prompts — keep it stable to keep cache hits.

T

Token: The unit of billing. Roughly 0.75 words for English, but varies by tokenizer. Input tokens, output tokens, cache-read tokens, cache-write tokens, and reasoning tokens are all priced differently.
Token leak: Spend on tokens that aren't producing user value: oversized retrieved context, runaway reasoning, repeated system prompts that should be cached, debug prints in production prompts. The unglamorous half of every audit.
Tool calls: Structured function-invocation messages a model emits to call external systems. Each round-trip is billed; agent loops with poor termination logic are a classic source of token leak.

V

Vertex AI: Google Cloud's managed inference platform. Hosts Gemini, Claude (via partnership), Llama, Mistral, and others. Different pricing tier and quota model than AI Studio. See provider arbitrage.
vLLM: An open-source high-throughput inference server. Relevant if you're self-hosting open-weights models on your own GPUs; not relevant for pure managed-API users.

← Back to llmcfo.com

Glossary.

A

B

C

D

E

F

H

I

K

L

M

O

P

R

S

T

V

Related