How to track AI token usage.
Operations guide · 11 June 2026
Token usage tracking is the practice of recording and categorizing every token type flowing through your LLM requests. Without it, cost baselines become meaningless, caching cannot be measured, and you cannot distinguish growth in reasoning workloads from growth in normal compute.
Token types and their cost structure
- Input tokens. Text sent to the model. Charged at the base rate per provider.
- Output tokens. Text generated by the model. Typically costs 2–4× more than input.
- Cache-read tokens (cached input). Previously cached prompt context that the model reuses. OpenAI charges ~50% of input; Anthropic charges ~10% of input—a dramatic discount that corrupts cost baselines if conflated with normal input.
- Reasoning tokens. Invisible token consumption during extended reasoning (OpenAI o1, Claude Opus with thinking). Billed as output tokens, not input, even though the application never sees them. Ignoring reasoning token growth leads to phantom cost spikes.
Essential fields to log on every request
- Provider and exact model version. Pricing changes by model tier (gpt-4o vs gpt-4-turbo) and by release date (gpt-4-turbo-2024-04-09 vs gpt-4-turbo-2024-12-04).
- Feature or endpoint. The product surface triggering the request—often the fastest path to finding waste.
- User, account, or customer ID. Required for abuse detection, usage quotas, and chargeback attribution.
- All four token counts. Input, output, cache-read, and reasoning (if applicable). Omitting any distorts your cost model.
- Retries and tool call count. A single user action may spawn multiple requests; counting at request level, not user action level, exposes hidden cost.
- Estimated cost. Compute it immediately using your pricing lookup table so you can alert before invoice close rather than after.
Aggregation levels that matter
Structure your telemetry to support rollups at each level:
- Per-request. The atomic fact: one call to the provider with timestamp, tokens, model, and tags.
- Per-feature or endpoint. Sum requests by the feature that triggered them. This is where most quick wins live.
- Per-customer or account. Necessary for invoicing, quota enforcement, and unit economics.
- Per-invoice period. Match your provider statement. Variance here usually signals missing requests or timestamp misalignment.
Common pitfalls that corrupt your baseline
- Relying on response usage fields alone. Provider APIs return usage estimates in the response. These are often correct, but they can lag, be incomplete (missing reasoning tokens), or double-count cached reads. Always record what you send; compare against the provider statement monthly.
- Conflating cache-read with normal input tokens. This is the most corrosive mistake. If you treat cache-read at the same rate as input, your baseline becomes useless when caching adoption grows (or shrinks). Track them separately from day one.
- Ignoring reasoning token growth. Reasoning tokens are invisible to the application but visible to your cost statement. If reasoning workloads grow and you only track input and output, your cost-per-request metrics will mysteriously deteriorate.
- Missing tool-call costs. Each tool call is a separate API request with its own token cost. If your instrumentation only logs the final response, you miss the cost of the intermediate calls.
- Logging without business metadata. Timestamps and token counts alone are not actionable. Tie every request to a feature, user, and model so you can drill down on cost spikes.
Practical tools and schema
OpenTelemetry's GenAI semantic conventions are a strong starting point because they standardize field names (model, input_tokens, output_tokens, llm.usage.cache_creation_input_tokens, llm.usage.cache_read_input_tokens) and reduce vendor lock-in. Observability platforms like Langfuse, Helicone, and LiteLLM all support this convention. For teams with custom logging, the schema should mirror these fields—separate buckets for each token type, not a sum.