RESEARCH · TECHNIQUE
Batch API routing.
25 April 2026
The Batch API is the cheapest dollar-for-dollar lever in the playbook: a flat ~50% off both input and output tokens in exchange for a 24-hour completion window. The only reason teams don't use it is they haven't audited which of their workloads are actually latency-sensitive.
What you get
| Batch surface | Discount mechanics |
|---|---|
| OpenAI Batch | ~50% off input + output · 24-hour SLA · JSONL upload · most chat/completions/embeddings models |
| Anthropic Message Batches | ~50% off input + output · 24-hour SLA · up to 100k requests / 256 MB per batch · all current Claude models |
| Bedrock batch inference | ~50% off · async S3 in/out · region + model coverage varies |
| Vertex batch prediction | ~50% off list price for Gemini · BigQuery or GCS in/out |
How it actually works
You upload a file of requests; the provider runs them when capacity is available; you poll or webhook for the result file. Caching, function calling, and structured outputs are typically supported. Streaming is not — by definition, batch is async. Failed individual requests come back in an error file alongside the success file.
Workloads that should be on batch today
- Eval pipelines. Regression tests, LLM-as-judge runs, golden-set scoring. These are the canonical batch case — you don't care if results land in 3 minutes or 3 hours, you care about the bill.
- Data enrichment. Tagging, classification, entity extraction over a backlog of records. If it's a one-time job over a million rows, it belongs in batch.
- Content generation at scale. Bulk product descriptions, alt text, translations, marketing variants. Anything where a human reviews the output later anyway.
- Nightly summarization. Daily digests, account-level recaps, weekly reports. Schedule the batch to start at 22:00; results are ready before the morning email goes out.
- Embeddings backfills. Reindexing a corpus, migrating embedding models. Long-tail volume that doesn't need to land synchronously.
- Synthetic data and fine-tune dataset prep. Generating training pairs, paraphrases, instruction variants.
Anti-patterns: do not route to batch
- User-facing endpoints. Anything where a human is waiting on a screen.
- Interactive agents and chat. The 24-hour SLA is a contract, not a soft target — your batch could land in 8 minutes or 22 hours.
- Tool-calling loops where one call's output feeds the next call's input. Batch can't do multi-step within a single batch.
- Anything tied to a webhook fan-out with a tight timeout (Stripe, GitHub, etc.).
- "Nice to have" real-time UX. If a PM can't articulate why the user needs it in 2 seconds, ask — but don't assume.
Practical migration order
- Pull last 30 days of usage by endpoint or job-name. Sort by spend.
- For the top 10, ask the owner: "would 24 hours later be fine?" In our audits, the answer is yes for 30–50% of spend.
- Migrate one endpoint at a time. Keep a fallback to sync for the long tail of items where 24 hours is genuinely too long.
- Reconcile the next invoice. Batch line items appear separately; confirm the discount actually landed.
Subtleties that bite
- Rate limits are separate from sync. You can typically queue more, but the per-batch token caps still apply.
- Cost reports don't auto-attribute batch lines to the right cost center unless you tag them in metadata.
- Cache hits don't compound with batch discounts on every provider — read the small print before modeling savings.
- Output token estimates are best-guess. Set a reasonable `max_tokens` ceiling; batch failures from one runaway request can hold up the whole file.