← home
RESEARCH · ROUTING

Provider arbitrage.

25 April 2026

By the LLM CFO team

A surprising amount of LLM spend is paying retail when wholesale is on the next shelf. The same open-weights model — Llama, Mistral, DeepSeek, Qwen — is hosted on a dozen providers at meaningfully different prices. Even closed models (Claude on Bedrock, Gemini on Vertex) sit on different pricing curves than their first-party endpoints when you factor in committed-use discounts and reserved capacity.

Where the price gaps actually live

The catches that erase the win

  1. Version drift. "Llama 3.1 70B Instruct" on Provider A and Provider B may be different quantizations (FP16 vs FP8 vs AWQ) and produce subtly different outputs. The cheaper provider sometimes serves a more aggressive quant. Always pin the exact version string and re-run your eval.
  2. Egress and routing cost. If your app runs in AWS us-east-1 and you call a provider in GCP europe-west, NAT and egress can swallow 5–15% of the savings. Co-locate or accept the tax explicitly.
  3. Regional availability. Bedrock and Vertex roll out models region-by-region. The model you want at the price you want may not be in the region your data residency policy allows.
  4. Rate limits and tail latency. Specialty hosts can have lower base rate limits and noisier p99 latency than the hyperscalers. Run a real load test, not a single-shot curl.
  5. Tool-calling and structured-output fidelity. Open-weights hosts implement function calling differently. The schema-conformance rate often degrades versus the source provider — your downstream JSON parser will tell you.
  6. Compliance. SOC 2 / HIPAA / data-processing terms vary by host. The legal review on a new provider can take longer than the savings are worth for a small endpoint.

How to A/B for true parity

  1. Build a representative eval set from your own logs (anonymized). 200–500 examples across the distribution of real traffic. Include the long tail.
  2. Run both providers head-to-head on the same prompts, same temperature, same seed where supported. Capture full output, latency, error rate, and cost per request.
  3. Score with your existing quality metric — LLM-as-judge against a stronger model, exact-match on structured tasks, or a rubric the product owner trusts.
  4. Decide on a quality floor, not a target. "We accept any provider whose pass rate is within 1 percentage point of incumbent." Then take the cheapest qualifying option.
  5. Re-run the eval monthly. Hosts silently swap quants. Your pass rate is not a one-time measurement.

Cross-vendor swaps where quality genuinely matches

Some swaps are durable because the underlying weights are identical and the host is just renting hardware:

Closed-model swaps across vendors (GPT to Claude to Gemini) are not arbitrage — that's model routing, and it requires a real eval because the answers will differ.

When arbitrage is and isn't worth pursuing

Related

← Back to llmcfo.com