What are the different token types and how do they cost differently?

Input tokens cost at the base rate. Output tokens cost 2–4× more than input. Cache-read tokens (cached input) cost ~50% of input on OpenAI and ~10% on Anthropic;a dramatic discount. Reasoning tokens are invisible but billed as output, even when the application never sees them. Conflating these types corrupts cost baselines.

What fields should I log on every request?

Log provider and exact model version (pricing changes by model tier and release), feature or endpoint, user/account/customer ID, all four token counts (input, output, cache-read, reasoning), retries and tool call count, and estimated cost so you can alert before invoice close.

What common mistakes corrupt token usage baselines?

The most corrosive mistakes are relying on response usage fields alone (they can lag or be incomplete), conflating cache-read with normal input tokens (useless when caching adoption grows), ignoring reasoning token growth (invisible to the app but visible on the bill), missing tool-call costs (each is a separate API request), and logging without business metadata.

How do I track cache-read tokens correctly?

Track cache-read tokens separately from day one in a distinct column, never aggregated with input tokens. This separation is essential because Anthropic's ~90% discount means caching adoption changes your baseline dramatically. When reconciling monthly, the provider statement will have cache-read as a separate line item.

OPERATIONS · MONITORING

How to track AI token usage.

Operations guide · June 11, 2026

By the LLM CFO team

Token usage tracking is the practice of recording and categorizing every token type flowing through your LLM requests. Without it, cost baselines become meaningless, caching cannot be measured, and you cannot distinguish growth in reasoning workloads from growth in normal compute.

Token types and their cost structure

Input tokens. Text sent to the model. Charged at the base rate per provider.
Output tokens. Text generated by the model. Typically costs 2–4× more than input.
Cache-read tokens (cached input). Previously cached prompt context that the model reuses. OpenAI charges ~50% of input; Anthropic charges ~10% of input;a dramatic discount that corrupts cost baselines if conflated with normal input.
Reasoning tokens. Invisible token consumption during extended reasoning (OpenAI o1, Claude Opus with thinking). Billed as output tokens, not input, even though the application never sees them. Ignoring reasoning token growth leads to phantom cost spikes.

Essential fields to log on every request

Provider and exact model version. Pricing changes by model tier (gpt-4o vs gpt-4-turbo) and by release date (gpt-4-turbo-2024-04-09 vs gpt-4-turbo-2024-12-04).
Feature or endpoint. The product surface triggering the request;often the fastest path to finding waste.
User, account, or customer ID. Required for abuse detection, usage quotas, and chargeback attribution.
All four token counts. Input, output, cache-read, and reasoning (if applicable). Omitting any distorts your cost model.
Retries and tool call count. A single user action may spawn multiple requests; counting at request level, not user action level, exposes hidden cost.
Estimated cost. Compute it immediately using your pricing lookup table so you can alert before invoice close rather than after.

Aggregation levels that matter

Structure your telemetry to support rollups at each level:

Per-request. The atomic fact: one call to the provider with timestamp, tokens, model, and tags.
Per-feature or endpoint. Sum requests by the feature that triggered them. This is where most quick wins live.
Per-customer or account. Necessary for invoicing, quota enforcement, and unit economics.
Per-invoice period. Match your provider statement. Variance here usually signals missing requests or timestamp misalignment.

Common pitfalls that corrupt your baseline

Relying on response usage fields alone. Provider APIs return usage estimates in the response. These are often correct, but they can lag, be incomplete (missing reasoning tokens), or double-count cached reads. Always record what you send; compare against the provider statement monthly.
Conflating cache-read with normal input tokens. This is the most corrosive mistake. If you treat cache-read at the same rate as input, your baseline becomes useless when caching adoption grows (or shrinks). Track them separately from day one.
Ignoring reasoning token growth. Reasoning tokens are invisible to the application but visible to your cost statement. If reasoning workloads grow and you only track input and output, your cost-per-request metrics will mysteriously deteriorate.
Missing tool-call costs. Each tool call is a separate API request with its own token cost. If your instrumentation only logs the final response, you miss the cost of the intermediate calls.
Logging without business metadata. Timestamps and token counts alone are not actionable. Tie every request to a feature, user, and model so you can drill down on cost spikes.

Tracking rule: if you cannot disaggregate your daily cost growth into (prompt length × request volume), (model mix shift), (reasoning token growth), and (cache hit rate change), your telemetry is incomplete.

Practical tools and schema

OpenTelemetry's GenAI semantic conventions are a strong starting point because they standardize field names (model, input_tokens, output_tokens, llm.usage.cache_creation_input_tokens, llm.usage.cache_read_input_tokens) and reduce vendor lock-in. Observability platforms like Langfuse, Helicone, and LiteLLM all support this convention. For teams with custom logging, the schema should mirror these fields;separate buckets for each token type, not a sum.

← Back to llmcfo.com

How to track AI token usage.

Token types and their cost structure

Essential fields to log on every request

Aggregation levels that matter

Common pitfalls that corrupt your baseline

Practical tools and schema

Related