RESEARCH · TECHNIQUE

Model routing.

25 April 2026

By the LLM CFO team

Most production traffic doesn't need the frontier model. Routing easy queries to a smaller/cheaper model and hard queries to the frontier model is the single highest-leverage optimization in the playbook — typically 30–50% reduction with no quality regression, when you do it right. Here's how we do it.

The two routing patterns that actually work

1. Classifier-up-front

A small classifier (often a fine-tuned `gpt-5-mini`, `claude-haiku`, or a local distilled BERT) reads the incoming request and predicts difficulty. Easy → cheap model. Hard → frontier model. Edge cases → frontier model.

Works well when your traffic has visibly distinct difficulty buckets (e.g. customer support: simple FAQ vs. multi-turn troubleshooting). Doesn't work when difficulty is uncorrelated with surface features.

2. Cascade with confidence

Cheap model attempts the answer. If its self-reported confidence (or a verifier check) is below threshold, escalate to the frontier model. Pay the cheap-model token cost twice for ~10–20% of traffic; pay the frontier-model cost on a fraction of overall traffic.

Works well when the cheap model is right most of the time and confidence is well-calibrated. Don't trust raw `logprobs` as confidence on aligned chat models — train or distill a verifier.

Patterns that don't work

Round-robin or weighted random across models of different quality. You'll meet your savings target and miss your quality SLO.
"Pick the cheapest model that passed eval" as a static decision. Real traffic distribution drifts away from your eval set within weeks.
Routing on token count alone. Long doesn't mean hard. Short doesn't mean easy.

Quality measurement

Every router ships with a 7-day A/B against the production baseline. Quality SLOs are agreed per endpoint:

LLM-as-judge on a held-out eval set, scored against the frontier-model output as reference.
Production proxies — thumbs-up rate, conversation length, escalation rate, downstream task completion.
Human spot-checks on 50–200 sampled responses per week.

If any SLO degrades meaningfully, the router auto-rolls back. There's no shipping a router without a rollback path.

Stack

Gateway with routing logic: LiteLLM, Helicone, or a custom proxy. We tend toward custom when the routing rules are non-trivial.
Classifier: fine-tuned mini-model on the platform you're already on (avoids adding a new vendor).
Verifier for cascades: distilled BERT or a fine-tuned mini-model trained on your eval traces.
A/B framework: anything that can split by user/session and persist assignment for at least 7 days.

Order of operations

Identify the top 3 spend endpoints.
For each: build a 200-row eval set with frontier-model answers as reference.
Test 2–3 candidate cheaper models against the eval set. Keep the ones that pass quality.
Build the classifier or cascade on the candidates that passed.
Ship at 1% traffic. Then 10%. Then 50%. Then 100%. A/B at every step.

← Back to llmcfo.com