LLM cost monitoring.
Operations guide · 29 April 2026
LLM cost monitoring is the practice of tracking spend at request level, not just invoice level. If you cannot explain which feature, user, model, or retry pattern created the cost spike, you are not monitoring spend. You are only seeing the bill after the damage is done.
The four layers of monitoring
Good LLM cost monitoring combines four data sources. Provider billing tells you what was charged. Gateway logs tell you routing and policy decisions. Application telemetry tells you product context. Warehouse reporting lets you join spend to customers, plans, releases, and margin.
Each layer answers a different question. Provider data answers "what is true?" Application telemetry answers "why did it happen?"
What to track on every request
- Provider and model. Exact model identity matters because pricing changes by model tier and version.
- Feature or endpoint. This is usually the fastest route to actionable savings.
- User, account, or workspace. Necessary for abuse detection, pricing, and chargebacks.
- Input and output tokens. The basic cost driver.
- Cached tokens. Especially useful with provider-native prompt caching.
- Latency, retries, and tool calls. Hidden cost often sits in failed or repeated work.
- Estimated cost. So you can alert before invoice close.
Use a consistent telemetry schema
OpenTelemetry's GenAI conventions are a strong baseline because they standardize fields like model, input tokens, output tokens, cache-read tokens, conversation identifiers, and provider names. That makes it easier to swap vendors or observability backends without rebuilding your data model every quarter.
The dashboards that matter
Most teams overbuild their dashboard layer and underbuild their tagging layer. Start with five views:
- Spend by feature. Top ten endpoints or workflows by total cost.
- Spend by model. Where the premium models are actually being used.
- Spend by customer or workspace. Profitability and quota control.
- Input and output token trends. Detect prompt growth and verbose output drift.
- Cache hit and retry rates. Two of the easiest places to find waste.
Alerts worth setting up
- Daily spend threshold by feature.
- Unexpected model mix shift. Example: a cheap endpoint suddenly starts defaulting to a frontier model.
- Prompt length regression.
- Retry storm or agent loop behavior.
- Reconciliation variance. Internal estimate drifts too far from provider truth.
Common failure modes
The most common monitoring mistake is relying on response usage fields alone and assuming they are enough for finance. They are not. Another common mistake is logging usage without business metadata, which makes it impossible to tie spend back to customers or features. A third is ignoring tool-call costs and caching behavior, which increasingly show up as material line items.
Simple stack that works
For a startup, a practical stack is provider-native billing plus a gateway, request metadata in application logs, and an observability tool like Langfuse or Helicone. For larger teams, add a warehouse layer and monthly reconciliation against provider or cloud billing exports.
Monitoring is not optimization, but it is the prerequisite
Monitoring alone does not lower your bill. It gives you the ranked list of what to fix first. Once that list is visible, the usual wins are smaller default models, output caps, prompt cleanup, prompt caching, batching, and quota controls.