LLM cost per request.
Operations guide · 11 June 2026
LLM cost per request is the unit economics metric that measures the total cost to serve one API call. It is the atomic unit of LLM profitability. If you only know total spend and total requests, you cannot distinguish between a cheap mass-market feature and an expensive specialist workflow—yet they require opposite optimization strategies.
The formula
Cost per request is deceptively simple to state and challenging to compute correctly. The formula is:
cost_per_request =
(input_tokens × input_price_per_1m) +
(output_tokens × output_price_per_1m) +
(cache_read_tokens × discounted_price_per_1m)
The trick is the pricing table. It must be keyed by provider + model + token type because OpenAI's GPT-4o input and output prices differ, Anthropic's cache-read discount is ~90% (not 50%), and the same model name on Azure or Bedrock has different pricing than native.
Why averages are dangerous
- Mean hides the long tail. If one customer submits 100k-token prompts and the rest submit 1k tokens, the mean cost per request is misleading. The p95 cost per request matters more for margin planning.
- Model mix shift is invisible at the aggregate. Switching one endpoint from GPT-3.5 to GPT-4 doubles cost per request for that endpoint. If it is 5% of your traffic, the overall cost per request rise is barely visible until the bill arrives.
- Feature profitability inverts with scale. A feature that costs $0.001 per request and generates $0.002 revenue is profitable at 100k users. At 1M users, if operational costs are fixed, it becomes a loss leader—but only if you segment by feature.
- Caching returns are invisible to the aggregate. Implementing prompt caching on one endpoint might reduce its cost per request by 60%, but if that endpoint is only 3% of your traffic, you will not see it in the company-wide metric.
Segment by feature, model, customer, and request type
The operational power of cost per request comes from segmentation:
- By feature or endpoint. Which workflows are expensive? Which are cheap? Are your high-margin features using cheap models?
- By model. What is the actual cost per request for GPT-4, Sonnet, and Haiku in your actual prompt distribution? Real data beats benchmark sheets.
- By customer or workspace. Do heavy users trigger longer contexts or more retries? Does one customer pay more per request than another because they use different models?
- By request type. Is your search feature cheaper than your summarization feature? Which retry loops are the most expensive?
Use it for margin analysis
Cost per request is the denominator of unit economics. The numerator is revenue per request, which you compute the same way:
margin_per_request = revenue_per_request − cost_per_request
For a feature billed at $0.01 per request, a cost per request of $0.004 leaves a $0.006 contribution. For the same billed price, if cost per request rises to $0.008, the contribution drops to $0.002. At scale, that difference is material. Tools like Langfuse and Helicone expose this data directly; internal deployments can compute it from provider billing exports plus application logs.
What moves cost per request
- Model choice. Switching to a smaller model is the fastest lever. GPT-3.5 costs ~12% of GPT-4o input price.
- Prompt length. Longer system prompts, few-shot examples, and context windows all scale input tokens linearly with cost.
- Output caps. Limiting max completion tokens is a hard ceiling on output cost per request.
- Prompt caching. OpenAI and Anthropic both discount cached tokens heavily (Anthropic ~90%, OpenAI ~50%). For stable context (system prompts, retrieved documents, code samples), caching returns 5–10× ROI.
- Retries and fallbacks. Each retry or model fallback adds cost per request. Minimize failed requests and timeouts.
- Batch processing. For latency-insensitive work, batch API endpoints usually have lower per-token pricing than synchronous calls.
How to measure it in practice
Start with your observability layer. If you use OpenTelemetry GenAI conventions, you have provider, model, input tokens, output tokens, and cache tokens on every span. Multiply each by the price table. If you use LiteLLM, it tracks cost per request natively. If you self-host, export provider usage CSVs (OpenAI, Anthropic, Bedrock, Azure, Vertex) and join them to application logs by timestamp and model. The result is a cost column you can segment by any business dimension.
Directional guidance only: compute your own baseline
Every team's cost per request is shaped by their own models, context length, output requirements, and caching strategy. Publishing a benchmark number like "Sonnet costs $0.002 per request" is tempting but useless: it depends on whether you cache, how long your prompts are, whether you retry, and which token ratios apply to your workload. Instead, use the formula above, gather 30 days of real telemetry, compute the p50 and p95 cost per request for each feature, and use those as your baseline. Then measure whether prompt caching, smaller models, or shorter contexts actually reduce it. Real data beats industry folklore every time.