RESEARCH · ROUTING
Provider arbitrage.
25 April 2026
A surprising amount of LLM spend is paying retail when wholesale is on the next shelf. The same open-weights model — Llama, Mistral, DeepSeek, Qwen — is hosted on a dozen providers at meaningfully different prices. Even closed models (Claude on Bedrock, Gemini on Vertex) sit on different pricing curves than their first-party endpoints when you factor in committed-use discounts and reserved capacity.
Where the price gaps actually live
- Open-weights on managed clouds vs. specialty hosts. Llama 3.1 70B on Together or Fireworks is typically 30–60% cheaper per million tokens than Bedrock's on-demand list. Bedrock pays back via reserved capacity and same-VPC inference.
- OpenRouter as a price-discovery layer. Same model, multiple backends, real-time routing to the cheapest healthy one. Useful for benchmarking what a fair price actually is.
- Claude on Bedrock vs. Anthropic direct. List prices are usually identical, but EDP / committed-use discounts on AWS materially change the effective rate. Anthropic also offers committed pricing — get both quotes.
- Gemini on Vertex vs. AI Studio. Same model name, different pricing tiers, different rate limits, and different region/quota governance.
- DeepSeek, Qwen, Mistral hosted variants. Often a 5–10× spread between the cheapest and most expensive host of the exact same checkpoint.
The catches that erase the win
- Version drift. "Llama 3.1 70B Instruct" on Provider A and Provider B may be different quantizations (FP16 vs FP8 vs AWQ) and produce subtly different outputs. The cheaper provider sometimes serves a more aggressive quant. Always pin the exact version string and re-run your eval.
- Egress and routing cost. If your app runs in AWS us-east-1 and you call a provider in GCP europe-west, NAT and egress can swallow 5–15% of the savings. Co-locate or accept the tax explicitly.
- Regional availability. Bedrock and Vertex roll out models region-by-region. The model you want at the price you want may not be in the region your data residency policy allows.
- Rate limits and tail latency. Specialty hosts can have lower base rate limits and noisier p99 latency than the hyperscalers. Run a real load test, not a single-shot curl.
- Tool-calling and structured-output fidelity. Open-weights hosts implement function calling differently. The schema-conformance rate often degrades versus the source provider — your downstream JSON parser will tell you.
- Compliance. SOC 2 / HIPAA / data-processing terms vary by host. The legal review on a new provider can take longer than the savings are worth for a small endpoint.
How to A/B for true parity
- Build a representative eval set from your own logs (anonymized). 200–500 examples across the distribution of real traffic. Include the long tail.
- Run both providers head-to-head on the same prompts, same temperature, same seed where supported. Capture full output, latency, error rate, and cost per request.
- Score with your existing quality metric — LLM-as-judge against a stronger model, exact-match on structured tasks, or a rubric the product owner trusts.
- Decide on a quality floor, not a target. "We accept any provider whose pass rate is within 1 percentage point of incumbent." Then take the cheapest qualifying option.
- Re-run the eval monthly. Hosts silently swap quants. Your pass rate is not a one-time measurement.
Cross-vendor swaps where quality genuinely matches
Some swaps are durable because the underlying weights are identical and the host is just renting hardware:
- Claude Sonnet on Bedrock vs. Anthropic — same weights, often a wash on price, sometimes a meaningful win on AWS commits.
- GPT-4o on Azure OpenAI vs. OpenAI — same weights, Azure adds enterprise terms; pricing is parity-ish, EA discounts can swing it.
- Llama 3.1 across Bedrock, Vertex, Together, Fireworks, Groq — same checkpoint family, but quantization and serving stack vary; eval before swapping.
Closed-model swaps across vendors (GPT to Claude to Gemini) are not arbitrage — that's model routing, and it requires a real eval because the answers will differ.
When arbitrage is and isn't worth pursuing
- Worth it: a single high-volume endpoint >$5k/mo, stable prompts, eval-able outputs.
- Not worth it: long tail of small endpoints with bespoke prompts and no shared quality metric. The eval cost dominates.
- Worth it: moving from on-demand to a committed/reserved tier with the incumbent — same provider, lower price, no eval needed.
- Maybe worth it: using a router (LiteLLM, OpenRouter) as a price-discovery and failover layer even if you don't switch primary provider.