RESEARCH · ROUTING

Provider arbitrage.

25 April 2026

By the LLM CFO team

A surprising amount of LLM spend is paying retail when wholesale is on the next shelf. The same open-weights model — Llama, Mistral, DeepSeek, Qwen — is hosted on a dozen providers at meaningfully different prices. Even closed models (Claude on Bedrock, Gemini on Vertex) sit on different pricing curves than their first-party endpoints when you factor in committed-use discounts and reserved capacity.

Where the price gaps actually live

Open-weights on managed clouds vs. specialty hosts. Llama 3.1 70B on Together or Fireworks is typically 30–60% cheaper per million tokens than Bedrock's on-demand list. Bedrock pays back via reserved capacity and same-VPC inference.
OpenRouter as a price-discovery layer. Same model, multiple backends, real-time routing to the cheapest healthy one. Useful for benchmarking what a fair price actually is.
Claude on Bedrock vs. Anthropic direct. List prices are usually identical, but EDP / committed-use discounts on AWS materially change the effective rate. Anthropic also offers committed pricing — get both quotes.
Gemini on Vertex vs. AI Studio. Same model name, different pricing tiers, different rate limits, and different region/quota governance.
DeepSeek, Qwen, Mistral hosted variants. Often a 5–10× spread between the cheapest and most expensive host of the exact same checkpoint.

The catches that erase the win

Version drift. "Llama 3.1 70B Instruct" on Provider A and Provider B may be different quantizations (FP16 vs FP8 vs AWQ) and produce subtly different outputs. The cheaper provider sometimes serves a more aggressive quant. Always pin the exact version string and re-run your eval.
Egress and routing cost. If your app runs in AWS us-east-1 and you call a provider in GCP europe-west, NAT and egress can swallow 5–15% of the savings. Co-locate or accept the tax explicitly.
Regional availability. Bedrock and Vertex roll out models region-by-region. The model you want at the price you want may not be in the region your data residency policy allows.
Rate limits and tail latency. Specialty hosts can have lower base rate limits and noisier p99 latency than the hyperscalers. Run a real load test, not a single-shot curl.
Tool-calling and structured-output fidelity. Open-weights hosts implement function calling differently. The schema-conformance rate often degrades versus the source provider — your downstream JSON parser will tell you.
Compliance. SOC 2 / HIPAA / data-processing terms vary by host. The legal review on a new provider can take longer than the savings are worth for a small endpoint.

How to A/B for true parity

Build a representative eval set from your own logs (anonymized). 200–500 examples across the distribution of real traffic. Include the long tail.
Run both providers head-to-head on the same prompts, same temperature, same seed where supported. Capture full output, latency, error rate, and cost per request.
Score with your existing quality metric — LLM-as-judge against a stronger model, exact-match on structured tasks, or a rubric the product owner trusts.
Decide on a quality floor, not a target. "We accept any provider whose pass rate is within 1 percentage point of incumbent." Then take the cheapest qualifying option.
Re-run the eval monthly. Hosts silently swap quants. Your pass rate is not a one-time measurement.

Cross-vendor swaps where quality genuinely matches

Some swaps are durable because the underlying weights are identical and the host is just renting hardware:

Claude Sonnet on Bedrock vs. Anthropic — same weights, often a wash on price, sometimes a meaningful win on AWS commits.
GPT-4o on Azure OpenAI vs. OpenAI — same weights, Azure adds enterprise terms; pricing is parity-ish, EA discounts can swing it.
Llama 3.1 across Bedrock, Vertex, Together, Fireworks, Groq — same checkpoint family, but quantization and serving stack vary; eval before swapping.

Closed-model swaps across vendors (GPT to Claude to Gemini) are not arbitrage — that's model routing, and it requires a real eval because the answers will differ.

When arbitrage is and isn't worth pursuing

Worth it: a single high-volume endpoint >$5k/mo, stable prompts, eval-able outputs.
Not worth it: long tail of small endpoints with bespoke prompts and no shared quality metric. The eval cost dominates.
Worth it: moving from on-demand to a committed/reserved tier with the incumbent — same provider, lower price, no eval needed.
Maybe worth it: using a router (LiteLLM, OpenRouter) as a price-discovery and failover layer even if you don't switch primary provider.

← Back to llmcfo.com