RESEARCH · TECHNIQUE

Batch API routing.

25 April 2026

By the LLM CFO team

The Batch API is the cheapest dollar-for-dollar lever in the playbook: a flat ~50% off both input and output tokens in exchange for a 24-hour completion window. The only reason teams don't use it is they haven't audited which of their workloads are actually latency-sensitive.

What you get

Batch surface	Discount mechanics
OpenAI Batch	~50% off input + output · 24-hour SLA · JSONL upload · most chat/completions/embeddings models
Anthropic Message Batches	~50% off input + output · 24-hour SLA · up to 100k requests / 256 MB per batch · all current Claude models
Bedrock batch inference	~50% off · async S3 in/out · region + model coverage varies
Vertex batch prediction	~50% off list price for Gemini · BigQuery or GCS in/out

How it actually works

You upload a file of requests; the provider runs them when capacity is available; you poll or webhook for the result file. Caching, function calling, and structured outputs are typically supported. Streaming is not — by definition, batch is async. Failed individual requests come back in an error file alongside the success file.

Workloads that should be on batch today

Eval pipelines. Regression tests, LLM-as-judge runs, golden-set scoring. These are the canonical batch case — you don't care if results land in 3 minutes or 3 hours, you care about the bill.
Data enrichment. Tagging, classification, entity extraction over a backlog of records. If it's a one-time job over a million rows, it belongs in batch.
Content generation at scale. Bulk product descriptions, alt text, translations, marketing variants. Anything where a human reviews the output later anyway.
Nightly summarization. Daily digests, account-level recaps, weekly reports. Schedule the batch to start at 22:00; results are ready before the morning email goes out.
Embeddings backfills. Reindexing a corpus, migrating embedding models. Long-tail volume that doesn't need to land synchronously.
Synthetic data and fine-tune dataset prep. Generating training pairs, paraphrases, instruction variants.

Anti-patterns: do not route to batch

User-facing endpoints. Anything where a human is waiting on a screen.
Interactive agents and chat. The 24-hour SLA is a contract, not a soft target — your batch could land in 8 minutes or 22 hours.
Tool-calling loops where one call's output feeds the next call's input. Batch can't do multi-step within a single batch.
Anything tied to a webhook fan-out with a tight timeout (Stripe, GitHub, etc.).
"Nice to have" real-time UX. If a PM can't articulate why the user needs it in 2 seconds, ask — but don't assume.

Practical migration order

Pull last 30 days of usage by endpoint or job-name. Sort by spend.
For the top 10, ask the owner: "would 24 hours later be fine?" In our audits, the answer is yes for 30–50% of spend.
Migrate one endpoint at a time. Keep a fallback to sync for the long tail of items where 24 hours is genuinely too long.
Reconcile the next invoice. Batch line items appear separately; confirm the discount actually landed.

Subtleties that bite

Rate limits are separate from sync. You can typically queue more, but the per-batch token caps still apply.
Cost reports don't auto-attribute batch lines to the right cost center unless you tag them in metadata.
Cache hits don't compound with batch discounts on every provider — read the small print before modeling savings.
Output token estimates are best-guess. Set a reasonable `max_tokens` ceiling; batch failures from one runaway request can hold up the whole file.

← Back to llmcfo.com