Multimodal costs sneak up.
Cost note · 7 May 2026
A lot of teams still manage AI cost like it is mostly a text-token problem. That is getting less true every quarter. Once image inputs, image generation, audio, or realtime sessions enter the product, cost grows in new directions and the old dashboards stop telling the full story.
Why this matters now
OpenAI's current pricing makes the shape of the issue obvious. Image tokens, audio tokens, realtime audio, and image generation each carry their own pricing logic, and some of them are materially more expensive than plain text input. The result is that a product can feel "chat-like" from the outside while actually being driven by image, audio, or realtime spend under the hood.
The common blind spot
Teams often aggregate everything into one AI bill and lose the modality split. Then they try to optimize text prompts while the real growth is coming from image detail settings, large input images, long audio sessions, or repeated multimodal retries.
Where the spend usually appears
- Vision inputs. Large images and high-detail inspection paths.
- Image generation. Resolution and quality tiers change the economics fast.
- Realtime audio. Long sessions can quietly accumulate high token volume.
- Multimodal retries. Re-sending images or audio after validation failures.
How teams keep it under control
- Track spend by modality, not just by model.
- Downshift quality where the user value is low. Not every image generation needs the highest tier.
- Use lower-fidelity inspection when possible.
- Limit session length and repeated uploads.
- Separate realtime from offline workflows. They should not share the same economic assumptions.
What to measure
- Spend by modality
- Average cost per image task
- Average audio session duration
- Retry rate on multimodal requests