← home
RESEARCH · PROVIDER POSTURE

Scale Tier vs Flex vs Batch vs Standard.

Published · 21 May 2026

By the LLM CFO team

Most teams default every call to standard processing. That is rarely the right answer. OpenAI now ships four processing paths with materially different price and latency profiles, and the cost gap between "all standard" and "the right tier per workload" is usually the biggest line item we touch in a first engagement. This is the decision matrix we use.

The four paths, in one sentence each

Standard is the default synchronous path — pay list price per token, get normal latency, no capacity commitments either direction. Good for interactive traffic you cannot predict.

Scale Tier is committed throughput at a premium. You reserve units of capacity for a term; in return you get latency and capacity guarantees that standard does not promise. Cheaper than standard only at the high end of utilization, and only if the commitment matches actual traffic.

Flex is the same synchronous request shape as standard, at a lower per-token price, with the explicit trade that the provider can deprioritize, slow, or briefly refuse your traffic when overall demand is high. The request returns; it just may return slower, or with a retry-after.

Batch is offline. You submit a JSONL file of requests, the provider processes them within a documented window (commonly up to 24 hours), and you collect results. The list-price posture is roughly half of standard. Wrong tool for anything a human is waiting on; right tool for anything a job scheduler is waiting on.

The three variables that decide

Every workload picks a tier on three axes, in this order:

1. Latency tolerance. If a user is staring at a spinner, you are in the standard or Scale Tier conversation. If a queue is staring at a spinner, Flex is on the table. If a dashboard refreshes in the morning, Batch is on the table.

2. Traffic shape. Smooth and predictable favors Scale Tier — you can size the commitment honestly. Spiky and unpredictable favors standard plus Flex overflow — you pay list for the peaks you actually use, and run cheaper paths for everything that can wait. Pure offline favors Batch.

3. Committed-spend posture. Can finance commit to a 6 or 12 month throughput line and defend it? Scale Tier becomes interesting. If next quarter's volume is a guess, do not lock anything in — standard plus Flex is cheaper in expectation than an over-committed reservation.

Get those three answers before you look at price. Picking a tier on price first is how teams end up paying for capacity they do not use.

A decision matrix

The archetypes below cover most of what we see in the field. Read across; the rationale matters more than the label.

Workload Default tier Rationale
Customer-facing chat / support Standard, Scale Tier if volume is large and smooth User is waiting. Capacity SLA matters more than per-token price. Move to Scale Tier only once the daily curve is boring.
Internal copilot / IDE assistant Standard for the hot path, Flex for non-blocking calls Employees tolerate more variance than customers. Background suggestions, refactors, and explanations can run cheaper and slower without complaint.
Nightly enrichment / classification Batch Async by design. Hours of wall time are fine. The discount is the entire reason this workload exists at scale.
Eval runs / regression suites Batch, Flex if you are iterating live Most evals can wait. The exception is the engineer debugging an eval at 2pm — Flex keeps the same request shape and is cheaper than standard.
Content backfill / corpus rewrite Batch Large volume, no user attached, retry-friendly. If you are running this on standard, you are leaving the most money on the table of any line item we touch.
RAG retrieval-time generation Standard Latency-sensitive, hard to predict shape per query. Flex risks making search feel broken when capacity is tight.
Agent loop steps (interactive) Standard A slow step compounds across the loop. Save Flex for tool calls that the agent can already handle async.
Agent loop steps (background workflows) Flex, Batch for plan-then-execute If no human is watching the agent, latency variance is acceptable. Plan-and-execute agents with offline plans can batch the plan step.
Embeddings refresh / re-index Batch Almost always offline; the discount applies cleanly; failure handling is simple.
Spiky high-volume customer feature Standard with Flex overflow Pay list for the peaks you actually serve; do not commit Scale Tier units to a peak you only hit twice a week.

When Scale Tier actually pays off

Scale Tier is the tier most teams pick wrong. The case for it is narrow:

Outside those three conditions, standard plus Flex usually wins on total cost, and Batch usually wins on the offline portion. We have moved more workloads off Scale Tier than onto it.

When Flex wins over Batch

Flex and Batch are both "cheaper than standard," but they fit different shapes.

Flex wins when the workload still wants the synchronous request shape — request in, response out, same code path — but is happy to retry or wait a little longer. Internal tooling, second-pass enrichment that fires from a web request, agent steps that are not on the user's critical path, regenerations after a user clicks "try again." None of those want a batch job orchestrator. They want the same SDK call, with a cheaper bill and a tolerance for the occasional retry-after.

Batch wins when you can rephrase the workload as a file. If you are already writing rows to a queue or table, you can already write JSONL. If you are already triggering work on a schedule, you can already wait on a job ID. The orchestration overhead is the price of admission; the discount pays it back fast at volume.

When Batch is the right answer even at the cost of latency

Sometimes hours of delay is a feature, not a bug. The signal that Batch is the correct call:

Most enrichment, classification, summarization-of-corpora, eval, and content-rewrite workloads check all four boxes. Putting any of them on standard is a posture mistake, not a performance choice.

What we would never put on Scale Tier

Workloads that should not see a Scale Tier commitment, even at scale:

The pricing posture that ties it together

The pattern that holds up across engagements: most teams should be on a mix, not a single tier.

Standard carries the interactive customer path — chat, search, checkout, anything where a slow response is a product defect. Flex covers internal tools and non-blocking customer features — the workloads that benefit from the synchronous shape but do not need the capacity guarantee. Batch handles offline work — enrichment, evals, embeddings, content rewrites, anything a scheduler can consume. Scale Tier is reserved, narrowly, for the workloads where a missed capacity SLA would visibly hurt revenue.

The mistake we see most often is the inverse: everything on standard "to keep it simple," with the Batch and Flex savings left on the table for a year, and a Scale Tier commitment bolted on top because someone in finance asked for predictable pricing. That stack is the most expensive way to run any of these workloads.

How to evaluate this in a week

You do not need a quarter to get the first cut right.

Day 1. Pull a week of usage. Tag every workload as interactive, internal, or offline. Note traffic shape — smooth or spiky.

Day 2. For each workload, write the latency budget in plain language. "User waits" / "queue waits" / "scheduler waits." That sentence picks the tier.

Day 3. Move the clearest Batch candidates first — nightly enrichment, eval runs, embeddings refresh. These are the safest moves and usually the largest dollar wins.

Day 4. Move internal-tool traffic to Flex behind a gateway or feature flag. Watch retry rates. If they stay reasonable, leave it there.

Day 5. Re-examine any existing Scale Tier commitment against the last 30 days of actual hourly usage. If utilization is below the breakeven, do not renew at that size.

That sequence usually lands the bulk of the savings before anyone has to argue about a commit. Scale Tier is a conversation you have later, once the mix is honest.

Related

← Back to research