LLM Token Types Explained: Pricing & Cost

Quick answer: What is Token Taxonomy

Token Taxonomy includes most common token types used in modern AI systems: input tokens, output tokens, reasoning tokens, speculative tokens, cached tokens, tool-use and retrieval tokens, multimodal tokens, and structural tokens. They are not interchangeable. Each type consumes compute differently, affects latency and context in different ways, and may be billed differently by providers. That is why tokens are no longer just a text-processing unit; they are now the core unit of AI economics and system design.

Subscribe for weekly operator-grade AI systems analysis

Key concepts in this article:

Input tokens: the tokens you send into the model during the prefill phase.
Output tokens: the tokens the model generates autoregressively, usually at higher cost.
Reasoning tokens: internal “thinking” tokens that can dominate usage on reasoning-heavy models.
Speculative tokens: draft tokens generated for speed, many of which are later discarded.
Cached tokens: reused prompt/context tokens that reduce recomputation and cost.
Tool-use and retrieval tokens: hidden overhead from system prompts, function schemas, agent loops, and RAG context.
Multimodal tokens: tokens created from images, audio, video, and code inputs.
Structural tokens: invisible control tokens such as role separators, BOS/EOS, and padding.
Why this matters: token types now shape pricing, architecture, latency, and product design.

Last time (AI 101: What is a Token?) we covered what a token is and how tokenization turns text/video/audio into something a model can process. That was the foundation. Now let's talk about what actually happens in production – because in production, there is no such thing as just "a token" anymore.

Jensen Huang recently said that the AI business is about transforming electrons into tokens. That framing is exactly right – but it is incomplete. Because "tokens" is no longer one thing. A single API call can now involve input tokens, output tokens, reasoning tokens, cached tokens, tool-use tokens, and vision tokens – each billed differently, each consuming compute in a different way. If you want to make good decisions about AI systems and how you are spending your electrons, you need a taxonomy: a clear map of what all these tokens actually are and why they behave so differently.

So thank you, Jensen, for the inspiration. Here is my taxonomy. Let me walk all of you through the full token zoo, species by species. I’m pretty sure you’ll find some you’ve never heard before (and still, you are paying for them).

Surprising result: you'll never look at an API pricing page the same way.

In today’s episode:

The basic split: input vs. output tokens
Reasoning tokens: the thinking tax
Speculative tokens: the ones born to be discarded
Cached tokens: the reuse discount
Tool-use, system, and retrieval tokens: the hidden overhead
Multimodal tokens: when images, audio, and video enter the pipeline
Structural tokens: the scaffolding you never see
Why the token zoo matters: economics and architecture
Sources and further reading

Input vs. Output Tokens: Why Output Costs More

Every API call has two sides: what you send in (input tokens) and what comes back (output tokens). And as you might guess, generating text is computationally harder than reading it.

When the model processes your input, it can do it in parallel. All the tokens in your prompt get processed more or less at once, in what is called the prefill phase. The model builds up its internal representation of everything you said in a single forward pass.

Output is different. The model generates tokens one at a time, each one depending on the one before it. This is autoregressive generation, and it is inherently sequential. Each new token requires a separate forward pass through the model – or at least through the decoder layers. That is why output tokens are more expensive. They require more compute per token.

This is also why you see the price difference on every API pricing page. As of mid-2026, output tokens typically cost 2x to 6x more than input tokens, depending on the provider and model. That ratio reflects a real difference in how the hardware is used.

Created with ChatGP

A practical implication: if you can restructure a task to reduce output length without losing quality – for example, by asking for structured JSON instead of verbose explanations – you are already saving.

Reasoning Tokens: What They Are and Why They Cost More

This is the category that has emerged most dramatically since 2024. Reasoning tokens – also called thinking tokens – are tokens the model generates internally as part of a chain-of-thought process before producing its final answer.

When you use models with extended thinking, the model does not jump directly to the answer. It first "thinks through" the problem, producing an internal monologue that may be partially or fully hidden from you. Those intermediate tokens still consume compute. They still cost money. And they can massively inflate the total token count of a response.

Here is what makes reasoning tokens interesting from an economic standpoint:

They can dominate total token usage. A math problem that produces a 200-token answer might generate 3,000 reasoning tokens internally. Your bill reflects the 3,200, not the 200.
They create a new optimization problem. With standard generation, you optimize for prompt efficiency and output conciseness. With reasoning models, you also have to think about whether the task actually benefits from extended reasoning. Simple tasks routed to a reasoning model are pure waste.

Some providers are now exposing reasoning token counts as a separate line item in their API responses. Others fold them into the output token price. This lack of standardization makes cost comparison between providers much harder.

Jensen Huang explicitly named reasoning tokens as a distinct pricing category in a recent conversation with Dwarkesh Patel. And the interesting thing here is that the token has become a segmented product, not a single commodity.

Here is some practical tips from Boris Cherny who builds Claude Code:

— # (#)

Speculative Tokens and Speculative Decoding Explained

If reasoning tokens are expensive because they add thinking, speculative tokens are strange for the opposite reason: they are generated specifically so that most of them can be thrown away.

This is one of the most counterintuitive ideas in the token zoo, and by 2026 it has become production-standard at most major inference providers.

Here is the problem it solves →

Autoregressive generation, as we described above, is sequential: one token at a time, each depending on the last. That is slow. The hardware sits partially idle waiting for each step to finish before starting the next.

Speculative decoding attacks this bottleneck with a clever trick. Instead of generating one token at a time with the full, expensive target model, a smaller and cheaper draft model races ahead and proposes several candidate tokens in parallel – sometimes 5, sometimes 10 or more. The target model then verifies them all at once in a single forward pass, keeping only the tokens that match its own distribution and discarding the rest.

The result is often 2–3× lower latency with only modest extra compute overhead. The target model's output quality is mathematically preserved – you get the same distribution of answers, just faster – because any draft token that does not match gets rejected and regenerated properly.

What makes speculative tokens interesting as a category is that they are real tokens. They are generated, they consume compute, they flow through the pipeline. But they are invisible to the user, and many of them are intentionally discarded. They exist as a speed optimization, not as content.

Several variants have emerged in the area of speculative decoding. EAGLE uses the target model's own hidden states to draft more accurately. Tree-based methods like SpecInfer propose branching candidate sequences rather than a single linear draft, increasing the chance that at least one path survives verification. Medusa adds extra prediction heads directly onto the target model, eliminating the need for a separate draft model entirely.

For users, speculative decoding is mostly invisible – you just notice faster responses. But for anyone managing inference infrastructure, it is one of the most important levers for latency and throughput. It also complicates the notion of "tokens generated," because the system may generate and discard 30% more tokens than the user ever receives.

Cached Tokens: How Prompt Caching Reduces LLM Costs

On the other end from reasoning tokens – which add compute – cached tokens subtract it. Prompt caching (or context caching) lets the model reuse work it has already done on identical input.

Here is how it works. When you send a prompt, the model processes it through its layers and builds a set of internal representations called the KV cache (key-value cache). If you send the same prompt prefix again – for example, the same system prompt, the same long document, the same set of instructions – the model can skip the computation for that part and reuse the cached KV values.

This matters enormously for production workloads. Consider an agentic system where every call includes a 4,000-token system prompt, a 10,000-token document, and a variable 200-token user query. Without caching, you recompute the system prompt and document on every single call. With caching, those 14,000 tokens are processed once and reused.

How providers price cached tokens:

Anthropic's prompt caching charges a small premium on the first write to cache, then offers cached reads at roughly 90% discount from regular input pricing. Google's context caching for Gemini works similarly. OpenAI's implementation varies by model.

The savings compound fast. A company running 10,000 API calls per day with a shared 8,000-token prefix can cut input token costs by an order of magnitude.

The catch: caches expire. They are typically held for minutes to hours, depending on the provider and the load on their serving infrastructure. If your traffic is bursty or infrequent, caching may not help much.

There is also cross-user KV caching, which is newer and more aggressive. If a thousand agents share the same system prompt, the serving infrastructure computes the KV cache once and makes it available to all of them. This is where the real infrastructure-level efficiency gains are happening, and it is one reason inference providers are racing to differentiate on caching strategy.

Tool-Use and Retrieval Tokens: Hidden Overhead in AI Agents

This is the category most people underestimate. When a model uses tools – function calling, API integration, code execution, web search – it is consuming tokens you never see directly.

Function schema tokens. When you enable function calling, you send the model a JSON schema describing each function: its name, parameters, types, descriptions. That schema is serialized into tokens and included in the context window. A modest set of 10 tools with detailed descriptions can easily add 2,000–4,000 tokens to every call. Those are input tokens, billed at input rates, present on every single invocation.
System prompt tokens. The system prompt – instructions that shape the model's behavior – is tokenized and included in every call. For complex enterprise deployments, system prompts can run to 5,000–10,000 tokens. Those tokens are invisible to end users but very visible on the bill. This is one reason prompt caching was invented.
Agentic loop tokens. This is where things get truly expensive. An AI agent that reasons, calls a tool, reads the result, calls another tool, reasons again, and finally answers might complete 5–15 internal loops before producing a response. Each loop processes the full conversation so far plus new content. Token consumption grows quadratically with the number of loops.

Created with ChatGPT

A concrete example: an agent with a 3,000-token system prompt, accessing 8 tools (2,500 tokens of schemas), processes a user query in 6 loops. By the final loop, the context might contain 30,000+ tokens. Total tokens consumed across all loops? Potentially 100,000+, even though the user sent a 50-token question and received a 300-token answer. This is the hidden cost of agentic AI. The token bill for an agent workflow can be 50x–200x what a single-turn conversation costs. Companies scaling agent deployments are learning this the hard way.

Cognition's Devin, for example, prices its work in Agent Compute Units (ACUs) — a normalized measure of resource consumption per task — rather than per token directly, reflecting exactly this reality: agentic workloads cost orders of magnitude more than single-turn calls. See how Cognition AI prices Devin

Retrieval and RAG tokens: the context that balloons. There is a specific variant of the hidden overhead problem that deserves attention on its own, because in many production systems it has become the single largest token consumer: retrieved context.
In retrieval-augmented generation (RAG), the model does not answer from memory alone. It first retrieves relevant documents – from a vector database, a search engine, an internal knowledge base – and those retrieved chunks are injected into the prompt as context. A user query of 30 tokens can easily trigger 5,000–15,000 tokens of retrieved documents, sometimes more.
That is already a significant multiplier. But in agentic RAG workflows, it compounds. The model retrieves, reads the results, decides it needs more information, retrieves again, reflects, retrieves a third time. Each retrieval round adds another wave of context tokens on top of everything already in the window. Multi-step agentic RAG can push total token consumption 3–10× higher than a single retrieval pass.
These tokens are valuable – grounding the model in real documents reduces hallucinations and improves accuracy. But they are expensive. Smart teams now treat retrieval quality and context compression as core token-economics levers. Reranking retrieved chunks so only the most relevant ones make it into the prompt, summarizing long documents before injection, and techniques like LLMLingua that compress retrieved text while preserving key information – these are all ways to get the benefit of retrieval without the full token cost. The goal is not to retrieve less, but to retrieve smarter.

Send subscription as a gift

Multimodal Tokens: Vision, Audio, Video, and Code

Tokens are NLP natives, but they have escaped into other modalities. Every modality gets tokenized differently, and those differences create dramatic cost variations.

Vision tokens

When you send an image to GPT, Claude, or Gemini, the model does not "see" the image the way you do. The image is divided into a grid of patches – small rectangular regions – and each patch becomes a token (or a small number of tokens). A high-resolution image can easily produce 1,000–2,000 tokens. A detailed infographic might push past 3,000.

This means uploading a screenshot for a simple question can cost more than a full page of text input. The image resolution directly controls token count: higher resolution = finer grid = more patches = more tokens. Most providers now offer a "low detail" mode that downscales the image first, producing fewer tokens at the cost of fidelity.

The architectural trick behind vision tokens is the vision encoder (often a ViT – Vision Transformer). It processes the image into patch embeddings, which are then projected into the same embedding space the language model uses. From the model's perspective, an image patch and a text token are the same kind of object once they enter the transformer.

Audio tokens

Audio tokenization takes a different path. Models like Whisper first convert audio into a spectrogram – a visual representation of frequencies over time – and then chop that spectrogram into frames. Each frame becomes a token. For speech, this typically means one token per 20–40 milliseconds of audio.

A 30-second audio clip might produce 750–1,500 tokens. A one-hour meeting recording? 90,000–180,000 tokens. This is why audio-native AI is expensive and why most real-world systems transcribe first and process text second.

Native multimodal models like Gemini can process audio tokens directly alongside text, which preserves tone, emphasis, and non-verbal cues but at a much higher token cost than a transcript.

Video tokens

Video is the most token-hungry modality. A video is essentially a sequence of images (frames) plus an audio track. Naive tokenization would produce frame_count × patches_per_frame tokens, which is enormous. A 10-second video at 30fps with 256 patches per frame would be 76,800 tokens – for ten seconds.

In practice, models use temporal compression. They sample keyframes, track changes between frames, and only tokenize the differences. But even with compression, video input remains orders of magnitude more expensive than text.

This is one reason video understanding is still an emerging capability. The token budget required makes it impractical for many applications at current prices.

Code tokens

Code is text, but it tokenizes differently than natural language. Whitespace, indentation, brackets, and syntax characters are common in code but rare in prose. Many tokenizers handle them inefficiently, splitting what a human sees as a single indent into multiple tokens.

This has practical consequences: the same logic expressed in Python (significant whitespace) and C++ (braces) can produce noticeably different token counts. Languages with verbose syntax (Java, for example) tend to be more token-expensive than terse ones (Python, Go).

Some newer models have tokenizers explicitly trained on code, which improves efficiency for programming tasks. But it is still common for a 500-line Python script to consume 3,000–5,000 tokens.

Structural Tokens: BOS, EOS, Padding, and Control Tokens

Every model uses special tokens that control behavior but carry no semantic content. These are rarely discussed, but they are always present.

BOS and EOS tokens (beginning-of-sequence, end-of-sequence) mark where input starts and output should stop. The model needs them to know when to start generating and when to stop.
Separator and role tokens. In a chat format, special tokens mark the boundary between the system message, the user message, and the assistant's response. These might look like <|system|>, <|user|>, <|assistant|> in the tokenized stream. Each one counts toward the context window.
Padding tokens. When batching multiple sequences for parallel processing, shorter sequences need padding to match the longest one. Padding tokens fill that space. They do not carry meaning, but they consume memory and compute during batched inference.
Special format tokens. Some models use tokens to trigger specific behaviors: <tool_call> to enter function-calling mode, <thinking> to begin reasoning, <search> to invoke retrieval. These are tokens in the fullest sense – they have embeddings, they take up context window space – but they serve as control signals rather than content.

You do not normally see these tokens. But they are why a "128K context window" does not mean you get exactly 128,000 tokens for your content. Some of that budget is consumed by structural tokens that the system requires.

LLM Token Economics: How Token Types Affect AI System Cost

All of these token types create a landscape where the same question – "How much does it cost to run this AI system?" – has a complicated answer. A few key implications:

Token-level pricing is now a segmented market. Jensen Huang's framing of "premium tokens" (faster, lower latency, higher ASP) vs. "throughput tokens" (cheaper, batch-processed, higher latency) is one axis of segmentation. Reasoning vs. non-reasoning is another. Cached vs. uncached is a third. The token is no longer a commodity. It is a product with tiers.
Not all tokens cost the same compute, even within the same model. In Mixture-of-Experts architectures only a small fraction of the model's total parameters activate for any given token. A model with 600 billion total parameters might activate only 50 billion per token. This means the headline parameter count and the actual compute-per-token are very different numbers, and it is another reason that comparing models by size alone is misleading.
Agentic workloads change the math completely. A chatbot costs roughly input + output tokens per message. An agent costs (input + output) × number_of_loops, with context growing each loop. You should budget for them differently. How organizations redesign workflows around agents — not just bolt them onto existing processes — is what determines where token costs actually land
Multimodal inputs create cost asymmetries. The same information – "What's in this image?" – might cost 50 tokens as a text description or 1,500 tokens as an image upload. Deciding which modality to use for a given task is now an economic decision, not just a UX one.
Optimization is now a first-class engineering concern. Prompt caching, output length control, routing simple tasks away from reasoning models, speculative decoding, batching, compressing tool schemas, reranking retrieval results before injection – these are all forms of token economy management. The companies running AI at scale are increasingly employing "token economics" thinking that parallels how cloud companies manage compute-hours.

And there is a deeper architectural point. The token types we have today – reasoning, speculative, cached, multimodal, tool-use – are not fixed. They are an artifact of current architectures. If a future model processes text and images through a fundamentally different mechanism, the token taxonomy will change with it. What will not change is the core reality: AI computation is metered, and the meter runs in tokens.

Conclusion

Understanding what a token is matters. But in production, what matters even more is knowing how to budget for the different species. Reasoning tokens multiply your bill in exchange for better answers. Speculative tokens are born and discarded by the millions, invisible but essential for speed. Cached tokens shrink costs when you can reuse context. Tool-use and retrieval tokens hide in the background, quietly dominating agentic workloads. Multimodal tokens let models see and hear, but at wildly different cost-per-information rates. And structural tokens consume context window space just to keep the system functioning.

Picking the best model is one thing. But the smartest teams are going further: into managing their token portfolio – routing tasks to the right model with the right token profile, caching aggressively, compressing where they can, and measuring what each token actually buys them. The next frontier may be tokens that delete themselves: dynamic pruning of low-value tokens mid-inference, freeing up context window space and compute. Still early, but it points toward a future where the token zoo is managed by the model itself.

We can see a future where the same model serves the same basic unit, tokens, at different price points depending on latency. Fast tokens for code completions cost more. Slow tokens for batch jobs cost less. That is closer to a commodity market, where pricing depends on delivery conditions and performance tier, than to traditional software, where the product is usually sold as a fixed package. As that happens, AI product design becomes partly a market design problem: you have to decide which interactions deserve premium real-time inference and which can be pushed into cheaper, delayed workflows. For anyone building AI products, that means architecture stops being only a technical question and becomes an economic one too. That is genuinely new.

Share the newsletter

Sources and further reading

Turing Post

AI 101: What Is a Token (and why it runs AI)?

Papers:

Vision Transformer (ViT): An Image is Worth 16x16 Words | Paper
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models | Paper
Fast Inference from Transformers via Speculative Decoding | Paper
Toolformer: Language Models Can Teach Themselves to Use Tools | Paper
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models | Paper
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty | Paper
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads | Paper
SpecInfer: Tree-based Speculative Inference and Verification | Paper

Docs:

What are tokens and how to count them? | OpenAI
Prompt caching | Anthropic
Extended thinking | Anthropic
Function calling and tool use | Anthropic
Context caching | Google Gemini

FAQ

What are the different types of LLM tokens?

The main LLM token types are input tokens, output tokens, reasoning tokens, cached tokens, speculative tokens, tool-use and retrieval tokens, multimodal tokens, and structural tokens. They all affect cost and latency differently: input tokens are processed during prefill, output tokens are generated one by one, reasoning tokens power hidden thinking, cached tokens reuse context, and speculative tokens speed up inference.

Why are input tokens cheaper than output tokens?

Input tokens are cheaper because the model can process them mostly in parallel during the prefill phase. Output tokens are more expensive because the model generates them one at a time, with each new token depending on the previous one. That sequential generation requires more compute per token, which is why output tokens usually cost more.

Do you get charged for reasoning tokens?

Usually, yes. Reasoning tokens are internal “thinking” tokens generated by reasoning models before the final answer. Some providers expose them as a separate line item, while others fold them into output token pricing. Either way, they consume compute and can significantly increase the total cost of a request.

What is a cached token in LLMs?

A cached token is a token from a prompt or context that the model does not need to recompute from scratch. If the same system prompt, document, or instruction prefix is reused, the provider can reuse cached representations and charge less for those tokens. Cached tokens are especially useful for agents, RAG systems, and enterprise apps with repeated long prompts.

What is a speculative token?

A speculative token is a draft token generated during speculative decoding. A smaller or cheaper model proposes several possible next tokens, and the larger target model checks them. Some are accepted, many are discarded. Users do not usually see speculative tokens, but they help reduce latency and make responses faster.

How many words are 1000 tokens?

As a rough rule, 1,000 English tokens are about 700–750 words. The exact number depends on the language, formatting, punctuation, and content type. Code, tables, JSON, and non-English text can use tokens very differently from plain English prose.

The article already explains these token categories in detail, especially input/output, reasoning, speculative, cached, tool-use, multimodal, and structural tokens.

AI 101: LLM Token Types: Input, Output, Reasoning, Cached & More