Model routing.
25 April 2026
Most production traffic doesn't need the frontier model. Routing easy queries to a smaller/cheaper model and hard queries to the frontier model is the single highest-leverage optimization in the playbook — typically 30–50% reduction with no quality regression, when you do it right. Here's how we do it.
The two routing patterns that actually work
1. Classifier-up-front
A small classifier (often a fine-tuned `gpt-5-mini`, `claude-haiku`, or a local distilled BERT) reads the incoming request and predicts difficulty. Easy → cheap model. Hard → frontier model. Edge cases → frontier model.
Works well when your traffic has visibly distinct difficulty buckets (e.g. customer support: simple FAQ vs. multi-turn troubleshooting). Doesn't work when difficulty is uncorrelated with surface features.
2. Cascade with confidence
Cheap model attempts the answer. If its self-reported confidence (or a verifier check) is below threshold, escalate to the frontier model. Pay the cheap-model token cost twice for ~10–20% of traffic; pay the frontier-model cost on a fraction of overall traffic.
Works well when the cheap model is right most of the time and confidence is well-calibrated. Don't trust raw `logprobs` as confidence on aligned chat models — train or distill a verifier.
Patterns that don't work
- Round-robin or weighted random across models of different quality. You'll meet your savings target and miss your quality SLO.
- "Pick the cheapest model that passed eval" as a static decision. Real traffic distribution drifts away from your eval set within weeks.
- Routing on token count alone. Long doesn't mean hard. Short doesn't mean easy.
Quality measurement
Every router ships with a 7-day A/B against the production baseline. Quality SLOs are agreed per endpoint:
- LLM-as-judge on a held-out eval set, scored against the frontier-model output as reference.
- Production proxies — thumbs-up rate, conversation length, escalation rate, downstream task completion.
- Human spot-checks on 50–200 sampled responses per week.
If any SLO degrades meaningfully, the router auto-rolls back. There's no shipping a router without a rollback path.
Stack
- Gateway with routing logic: LiteLLM, Helicone, or a custom proxy. We tend toward custom when the routing rules are non-trivial.
- Classifier: fine-tuned mini-model on the platform you're already on (avoids adding a new vendor).
- Verifier for cascades: distilled BERT or a fine-tuned mini-model trained on your eval traces.
- A/B framework: anything that can split by user/session and persist assignment for at least 7 days.
Order of operations
- Identify the top 3 spend endpoints.
- For each: build a 200-row eval set with frontier-model answers as reference.
- Test 2–3 candidate cheaper models against the eval set. Keep the ones that pass quality.
- Build the classifier or cascade on the candidates that passed.
- Ship at 1% traffic. Then 10%. Then 50%. Then 100%. A/B at every step.