← home
RESEARCH · TECHNIQUE

Batch API routing.

25 April 2026

By the LLM CFO team

The Batch API is the cheapest dollar-for-dollar lever in the playbook: a flat ~50% off both input and output tokens in exchange for a 24-hour completion window. The only reason teams don't use it is they haven't audited which of their workloads are actually latency-sensitive.

What you get

Batch surfaceDiscount mechanics
OpenAI Batch~50% off input + output · 24-hour SLA · JSONL upload · most chat/completions/embeddings models
Anthropic Message Batches~50% off input + output · 24-hour SLA · up to 100k requests / 256 MB per batch · all current Claude models
Bedrock batch inference~50% off · async S3 in/out · region + model coverage varies
Vertex batch prediction~50% off list price for Gemini · BigQuery or GCS in/out

How it actually works

You upload a file of requests; the provider runs them when capacity is available; you poll or webhook for the result file. Caching, function calling, and structured outputs are typically supported. Streaming is not — by definition, batch is async. Failed individual requests come back in an error file alongside the success file.

Workloads that should be on batch today

  1. Eval pipelines. Regression tests, LLM-as-judge runs, golden-set scoring. These are the canonical batch case — you don't care if results land in 3 minutes or 3 hours, you care about the bill.
  2. Data enrichment. Tagging, classification, entity extraction over a backlog of records. If it's a one-time job over a million rows, it belongs in batch.
  3. Content generation at scale. Bulk product descriptions, alt text, translations, marketing variants. Anything where a human reviews the output later anyway.
  4. Nightly summarization. Daily digests, account-level recaps, weekly reports. Schedule the batch to start at 22:00; results are ready before the morning email goes out.
  5. Embeddings backfills. Reindexing a corpus, migrating embedding models. Long-tail volume that doesn't need to land synchronously.
  6. Synthetic data and fine-tune dataset prep. Generating training pairs, paraphrases, instruction variants.

Anti-patterns: do not route to batch

Practical migration order

  1. Pull last 30 days of usage by endpoint or job-name. Sort by spend.
  2. For the top 10, ask the owner: "would 24 hours later be fine?" In our audits, the answer is yes for 30–50% of spend.
  3. Migrate one endpoint at a time. Keep a fallback to sync for the long tail of items where 24 hours is genuinely too long.
  4. Reconcile the next invoice. Batch line items appear separately; confirm the discount actually landed.

Subtleties that bite

Related

← Back to llmcfo.com