Scale Tier vs Flex vs Batch vs Standard.
Published · 21 May 2026
Most teams default every call to standard processing. That is rarely the right answer. OpenAI now ships four processing paths with materially different price and latency profiles, and the cost gap between "all standard" and "the right tier per workload" is usually the biggest line item we touch in a first engagement. This is the decision matrix we use.
The four paths, in one sentence each
Standard is the default synchronous path — pay list price per token, get normal latency, no capacity commitments either direction. Good for interactive traffic you cannot predict.
Scale Tier is committed throughput at a premium. You reserve units of capacity for a term; in return you get latency and capacity guarantees that standard does not promise. Cheaper than standard only at the high end of utilization, and only if the commitment matches actual traffic.
Flex is the same synchronous request shape as standard, at a lower per-token price, with the explicit trade that the provider can deprioritize, slow, or briefly refuse your traffic when overall demand is high. The request returns; it just may return slower, or with a retry-after.
Batch is offline. You submit a JSONL file of requests, the provider processes them within a documented window (commonly up to 24 hours), and you collect results. The list-price posture is roughly half of standard. Wrong tool for anything a human is waiting on; right tool for anything a job scheduler is waiting on.
The three variables that decide
Every workload picks a tier on three axes, in this order:
1. Latency tolerance. If a user is staring at a spinner, you are in the standard or Scale Tier conversation. If a queue is staring at a spinner, Flex is on the table. If a dashboard refreshes in the morning, Batch is on the table.
2. Traffic shape. Smooth and predictable favors Scale Tier — you can size the commitment honestly. Spiky and unpredictable favors standard plus Flex overflow — you pay list for the peaks you actually use, and run cheaper paths for everything that can wait. Pure offline favors Batch.
3. Committed-spend posture. Can finance commit to a 6 or 12 month throughput line and defend it? Scale Tier becomes interesting. If next quarter's volume is a guess, do not lock anything in — standard plus Flex is cheaper in expectation than an over-committed reservation.
Get those three answers before you look at price. Picking a tier on price first is how teams end up paying for capacity they do not use.
A decision matrix
The archetypes below cover most of what we see in the field. Read across; the rationale matters more than the label.
| Workload | Default tier | Rationale |
|---|---|---|
| Customer-facing chat / support | Standard, Scale Tier if volume is large and smooth | User is waiting. Capacity SLA matters more than per-token price. Move to Scale Tier only once the daily curve is boring. |
| Internal copilot / IDE assistant | Standard for the hot path, Flex for non-blocking calls | Employees tolerate more variance than customers. Background suggestions, refactors, and explanations can run cheaper and slower without complaint. |
| Nightly enrichment / classification | Batch | Async by design. Hours of wall time are fine. The discount is the entire reason this workload exists at scale. |
| Eval runs / regression suites | Batch, Flex if you are iterating live | Most evals can wait. The exception is the engineer debugging an eval at 2pm — Flex keeps the same request shape and is cheaper than standard. |
| Content backfill / corpus rewrite | Batch | Large volume, no user attached, retry-friendly. If you are running this on standard, you are leaving the most money on the table of any line item we touch. |
| RAG retrieval-time generation | Standard | Latency-sensitive, hard to predict shape per query. Flex risks making search feel broken when capacity is tight. |
| Agent loop steps (interactive) | Standard | A slow step compounds across the loop. Save Flex for tool calls that the agent can already handle async. |
| Agent loop steps (background workflows) | Flex, Batch for plan-then-execute | If no human is watching the agent, latency variance is acceptable. Plan-and-execute agents with offline plans can batch the plan step. |
| Embeddings refresh / re-index | Batch | Almost always offline; the discount applies cleanly; failure handling is simple. |
| Spiky high-volume customer feature | Standard with Flex overflow | Pay list for the peaks you actually serve; do not commit Scale Tier units to a peak you only hit twice a week. |
When Scale Tier actually pays off
Scale Tier is the tier most teams pick wrong. The case for it is narrow:
- Smooth, high-volume traffic. A daily curve you can draw from memory. If the variance band on your hourly request rate is wider than the commitment unit, you will over-buy.
- Capacity matters more than price. When a missed capacity SLA hurts revenue — checkout assistants, paid-tier chat, anything where degraded latency is a customer-facing incident — the premium buys insurance, not throughput per dollar.
- The business can predict 12 months out. If finance signs the commit and product cannot defend the volume assumption, you have just turned a variable cost into a fixed cost on bad terms.
Outside those three conditions, standard plus Flex usually wins on total cost, and Batch usually wins on the offline portion. We have moved more workloads off Scale Tier than onto it.
When Flex wins over Batch
Flex and Batch are both "cheaper than standard," but they fit different shapes.
Flex wins when the workload still wants the synchronous request shape — request in, response out, same code path — but is happy to retry or wait a little longer. Internal tooling, second-pass enrichment that fires from a web request, agent steps that are not on the user's critical path, regenerations after a user clicks "try again." None of those want a batch job orchestrator. They want the same SDK call, with a cheaper bill and a tolerance for the occasional retry-after.
Batch wins when you can rephrase the workload as a file. If you are already writing rows to a queue or table, you can already write JSONL. If you are already triggering work on a schedule, you can already wait on a job ID. The orchestration overhead is the price of admission; the discount pays it back fast at volume.
When Batch is the right answer even at the cost of latency
Sometimes hours of delay is a feature, not a bug. The signal that Batch is the correct call:
- The output is consumed by another system, not a person.
- The volume is large enough that the discount shows up as a real line item, not a rounding error.
- The work is naturally idempotent — re-running a chunk costs nothing the business cares about.
- The downstream system can wait until tomorrow morning for last night's data.
Most enrichment, classification, summarization-of-corpora, eval, and content-rewrite workloads check all four boxes. Putting any of them on standard is a posture mistake, not a performance choice.
What we would never put on Scale Tier
Workloads that should not see a Scale Tier commitment, even at scale:
- Anything where capacity is not the bottleneck. If your problem is per-token cost or model quality, Scale Tier does not solve it.
- Anything already comfortably served by Flex. You are paying a premium to undo the discount.
- Anything served by Batch. You are paying a premium to undo a larger discount.
- Workloads with traffic variance wider than the commitment unit. You will eat the unused capacity every quiet hour.
- Pilot or proof-of-concept volume. Commitments are for boring, proven workloads.
The pricing posture that ties it together
The pattern that holds up across engagements: most teams should be on a mix, not a single tier.
Standard carries the interactive customer path — chat, search, checkout, anything where a slow response is a product defect. Flex covers internal tools and non-blocking customer features — the workloads that benefit from the synchronous shape but do not need the capacity guarantee. Batch handles offline work — enrichment, evals, embeddings, content rewrites, anything a scheduler can consume. Scale Tier is reserved, narrowly, for the workloads where a missed capacity SLA would visibly hurt revenue.
The mistake we see most often is the inverse: everything on standard "to keep it simple," with the Batch and Flex savings left on the table for a year, and a Scale Tier commitment bolted on top because someone in finance asked for predictable pricing. That stack is the most expensive way to run any of these workloads.
How to evaluate this in a week
You do not need a quarter to get the first cut right.
Day 1. Pull a week of usage. Tag every workload as interactive, internal, or offline. Note traffic shape — smooth or spiky.
Day 2. For each workload, write the latency budget in plain language. "User waits" / "queue waits" / "scheduler waits." That sentence picks the tier.
Day 3. Move the clearest Batch candidates first — nightly enrichment, eval runs, embeddings refresh. These are the safest moves and usually the largest dollar wins.
Day 4. Move internal-tool traffic to Flex behind a gateway or feature flag. Watch retry rates. If they stay reasonable, leave it there.
Day 5. Re-examine any existing Scale Tier commitment against the last 30 days of actual hourly usage. If utilization is below the breakeven, do not renew at that size.
That sequence usually lands the bulk of the savings before anyone has to argue about a commit. Scale Tier is a conversation you have later, once the mix is honest.