What Is a Token in AI? Tokenization, Context & Cost

What is a token in AI?

A token is the basic unit of text an AI model actually reads and predicts. Before a model can process language, raw text is split into tokens, converted into token IDs, mapped to vectors, and then processed through attention to generate the next token. Tokens matter far beyond model internals: they determine context length, latency, memory use, and API cost, which is why they have become the core technical and economic unit of modern AI systems.

Subscribe for weekly operator-grade AI systems analysis:

Key concepts in this article:

Token: the discrete unit a model reads and writes.
Tokenization: the process of converting raw text into model-readable tokens.
Subword tokenization: the most common modern approach, balancing vocabulary size and flexibility.
BPE / WordPiece / SentencePiece: the main tokenization families used in LLMs.
Context window: the total number of tokens a model can handle in one pass.
Input / output / cached / reasoning tokens: the token categories that shape latency and pricing.
Token economics: the link between model design, inference cost, and user experience.

With so much happening in AI, going back to basics can be surprisingly grounding. It helps reduce anxiety and gives you a firmer grasp of the technology you’re dealing with. So today, we’re doing pure AI 101 and talking about The Token. That little fellow that runs AI. It sounds simple, but it isn’t. If you want to understand generative AI, you need a clear understanding of what a token actually is. Today we’re going to dive into the very, very basics of what a token is, how tokens are formed, why the type of tokenization matters, and how tokens became the new currency. Let’s go!

Surprising result: you feel much more confident after reading this article.

In today’s episode:

What is a Token?
Tokenization: How to get tokens from the text
How AI models process tokens
Why tokens are so important
The economics of tokens
What about token economy in open models?
Conclusion
Sources and further reading

What is a Token in AI?

We deal with tokens every day, but what exactly is a token? How does a model “see” information? Or does it see it at all?

Not in the way we do. A model does not see words, sentences, or meaning in any human sense. It starts with tokens: small units into which text is broken before it can be processed at all. Those tokens are converted into IDs, then into vectors, and only from there does the model begin to work with the input.

That may sound technical, but it has very practical consequences. Tokens are not only the unit a model reads. They are also the unit that shapes how much text the model can handle, how fast it responds, how much memory it uses, and how much it costs to run. That is why tokens ended up becoming the basic currency of generative AI.

And this is where things get slightly counterintuitive: a token is not the same as a word. It can be a whole word, part of a word, punctuation mark, space, or character sequence the model has learned to treat as one unit. In other words, before a model can generate language, language first has to be broken into pieces the model can count.

Image Credit: OpenAI

In practice, common words are often a single token, while rarer or longer words may be split into smaller pieces, such as encod + ing. This is how models stay flexible: instead of memorizing every possible word, they learn to work with reusable parts.

There is a useful English-language rule of thumb from OpenAI: one token is roughly four characters, or about three-quarters of a word, and one to two sentences are around 30 tokens. But that is only a rough guide. The actual count depends on the tokenizer and the language. The same idea expressed in another language may take more tokens, which is one reason token costs are not experienced equally across languages.

English is often tokenized into words and subword pieces, because spaces clearly separate words and longer terms can be broken into reusable chunks. Chinese works differently: words are not separated by spaces, and single characters often already carry meaning, so tokenization tends to stay closer to the character level. That is one reason the same sentence can produce a very different token count in English and Chinese.

So how does raw text become something a model can process? Through →

Tokenization: How to get tokens from the text

Before the model starts to process text it needs to be tokenized. And this is where things start to get interesting. Modern AI systems usually do subword tokenization, which sits between word-level and character-level tokenization. That compromise solves an important problem: it keeps the vocabulary relatively compact, while still allowing the model to handle rare or unseen words by breaking them into smaller meaningful pieces.

This details is important because the way text is split changes how efficiently a model reads, how much context it can fit, and sometimes even how well it performs across languages. The three most important tokenization approaches to know are Byte Pair Encoding (BPE), WordPiece, and SentencePiece.

Let’s unfold each of them (plus a couple more).

Byte Pair Encoding (BPE)

BPE is a method that builds tokens step by step by merging the character pairs that appear most often together. It starts with individual characters, so every word is first split into letters. Then it looks across the training text, finds the most common adjacent pair, and merges it into a new unit. This process repeats again and again for a set number of steps. So if “t” and “h” often appear together, they may become “th” Later, if “th” and “e” appear together often enough, they may become “the” Over time, frequent pieces of words become single tokens.

Image Credit: “Neural Machine Translation of Rare Words with Subword Units” paper

In practice, a BPE tokenizer is defined by a learned list of merge rules. Frequent letter combinations become single tokens, while rare words are only merged part of the way. Very common words may end up as one token. At inference time, the model does not relearn any of this. It simply applies the stored merge rules to new text in the same order they were learned during training.

BPE is widely used in models such as GPT, RoBERTa, LLaMA, and Mistral.

Because BPE is so widely used, a few advanced variants are also worth knowing.

BPE-dropout makes tokenization less rigid during training. It randomly skips some merge steps, even when those merges are allowed, so the same word can be split in slightly different ways each time. For example, instead of always becoming “unrelated,” a word might sometimes stay split as “un + relate + d.” This variability helps the model learn word structure better and become more robust, while still using standard BPE at test time.

Image Credit: BPE-Dropout original paper

Byte-level BPE goes one level lower and works with bytes instead of characters or words. That means nothing is truly out of vocabulary: even unusual words, emojis, or text from different writing systems can still be broken down and processed. It is especially useful for messy text and for languages where standard tokenization can become inefficient.

Image Credit: Bit-level BPE original paper

WordPiece

WordPiece was introduced by Google and is best known from the BERT family of models. It is similar to BPE, but it follows a slightly different logic.

Like BPE, WordPiece starts from smaller units and builds larger ones over time. It also keeps track of whether a piece appears at the beginning of a word or inside it. So a word like “word” might be split as w ##o ##r ##d, where ## means the piece continues the same word.

What makes WordPiece interesting is that it does not simply merge the most frequent pairs. It prefers pairs that appear together often, but whose individual parts are less common on their own. In other words, it tends to favor more specific combinations. That often leads to subword units that feel a bit more meaningful, such as common stems, prefixes, or suffixes.

For example, WordPiece might split “playing” into play + ##ing, or “lowest” into low + ##est, reusing common pieces instead of treating every word as completely separate.

At inference time, WordPiece does not replay the whole training process. It uses the final vocabulary and splits a word by choosing the longest matching piece from left to right. If it cannot find a match, the word is turned into an unknown token.

WordPiece is a good reminder that tokenization is not only about chopping text into pieces. It is also about deciding which pieces are worth keeping together.

SentencePiece

SentencePiece tokenization, in contrast, trains directly on raw text, without assuming words are already separated by spaces. It learns subword units from the sequence itself, which makes it language-independent and a good choice for multilingual systems.

One of its most important design choices is lossless tokenization. Spaces are treated as ordinary symbols, often marked explicitly, which means the original text can be reconstructed exactly. In other words, tokenization and detokenization become reversible operations.

For example, the phrase:

Hello world

might be tokenized as:

▁Hello + ▁world

Here the special symbol ▁ marks a space before the token. That means the space is preserved inside the tokenization itself.

A longer phrase like:

I love tokenization

could become:

▁I + ▁love + ▁token + ization

or, depending on the learned vocabulary:

▁I + ▁love + ▁tokenization

Like other tokenizers, SentencePiece learns a fixed-size vocabulary during training. At inference time, text is normalized, split into subword units using the learned model, and mapped to token IDs.

Other tokenization approaches worth knowing

There are many ways to divide text into tokens, depending on what you want the system to do:

An n-gram tokenizer breaks text into small, overlapping chunks of characters of a fixed length. For example, the word “search” with 3-grams becomes “sea”, “ear”, “arc”, “rch”. These chunks slide across the word one step at a time. Smaller n-grams give more flexibility, while larger ones give more precision. This approach is often useful in search, autocomplete, and spell checking.
A unigram language model tokenizer takes a probabilistic approach. It starts with a large set of possible subwords and learns how likely each one is. Since a word or sentence can often be split in multiple ways, the tokenizer can choose the most likely segmentation or sample different ones during training. Fro example, it does un + happi + ness instead of u + n + happiness or unhappy + ness. That flexibility is one reason it works well with subword regularization.
There are also hybrid tokenization approaches designed for morphologically rich languages such as Turkish, where words are built from roots, suffixes, and prefixes. In those cases, the tokenizer may first split a word using linguistic structure and only then fall back to subword methods for anything unfamiliar.

Image Credit: “Tokens with Meaning: A Hybrid Tokenization
Approach for Turkish” paper

So instead of treating the whole word as one unit, the tokenizer separates it into meaningful parts: root plus suffixes. That is useful in languages where a lot of meaning is packed into endings.

Overall, tokenization is best understood as a learned compression scheme for language: it packs common patterns into single units and leaves uncommon strings as combinations of smaller units.

The next stage is what to do with these tokens →

Send subscription as a gift

How AI models process tokens

Since the most common architecture is the Transformer, we’ll look at how a token moves through the model using it as an example. Here is a brief end-to-end “token lifecycle” typical for modern LLMs (we’ll cover this in more detail in next episodes).

Once text is tokenized, each token is turned into a numerical ID. For the model, that token is now the basic unit of computation.
From token to vector
The model does not work directly with token IDs. It first maps each one to an embedding: a learned vector of numbers. You can think of this as the model’s internal representation of that token.
But tokens also need order. Without that, the model would not know the difference between “dog bites man” and “man bites dog.” So Transformers add positional information to each token. Different models do this in different ways, including rotary positional encoding, but the core idea stays the same: the model tracks position at the token level.
How tokens interact
Then comes self-attention, one of the central ideas behind the Transformer. This is the stage where each token looks at other tokens in the same sequence to figure out what matters for its meaning.
This is where things get interesting. A token does not carry a fixed meaning on its own. Its representation changes depending on the tokens around it. That is why the word “bank” can point to finance in one sentence and to the side of a river in another. Context changes the token.
This is also why tokens are such a big deal in AI. Much of the model’s computation is built around token-by-token interactions across the sequence.
How the model generates text
At inference time, the model predicts the next token, then the next one, then the next one. This is called autoregressive generation. It does not produce a full sentence in one shot. It generates a stream of token predictions, which is then decoded back into human-readable text.
That is also why response length is measured in output tokens. The model is literally building the answer one token at a time.

It sounds mechanical, but it is also one of the most fascinating parts of modern AI: everything we experience as fluent language begins as a long sequence of token guesses.

It may be obvious now, but anyway →

Why tokens are so important

Token is the fundamental part of AI and it’s ubiquitous. Every prompt consumes input tokens; every model’s response consumes output tokens; long conversations reuse or reprocess old tokens; and tool use and extended reasoning add even more tokens. Transformers and other AI models operate natively at the token level.

At every stage of work, tokens shape model quality, and tokenization matters here. Word-level vocabularies explode in size; character-level vocabularies create very long sequences with weak semantics. Subword tokenization is the compromise that makes LLMs more practical: small enough vocabularies to train efficiently, expressive enough pieces to handle open vocabulary.

Moreover, tokens also determine how much a model can “remember” in one pass. A model’s context window is measured in tokens, not pages or words. It is a unified measure that is convenient for everyone.

The context window is the max amount of tokenized information the model can consider at once (typically including conversation history and output budget).

The most common context window is 128k tokens, and there’s a trend toward making them as large as possible. Here are some of the existing giants in this field.

With a larger context window you can use longer conversations, documents, or code without forgetting earlier parts. That also means a long PDF, a codebase, or a chat history is fundamentally a budgeting problem over tokens.

So yes, tokens ARE about the money: tokens are how AI is priced. Context limits, output limits, rate limits, and API billing are all expressed in tokens. So tokens are not just how models read. They are also how the whole system is measured, constrained, and sold.

The economics of tokens

Tokens are, by all means, a pricing unit of generative AI. They determine how much an AI call costs, how much context you can fit into a prompt, and often how fast the system responds. That creates a strange but important reality: in AI, language has become metered infrastructure. Every instruction, every uploaded document, every example, every generated paragraph is turned into tokens and counted.

We still tend to talk about language as if it were free. In generative AI, it is not.That shift has even changed how strategists talk about AI investment: Satya Nadella's concept of token capital treats token usage as a form of proprietary AI capability a firm builds and owns.

Most providers charge separately for input and output tokens, and output is usually more expensive because generating text requires more compute than reading it. Some systems also count cached tokens at a reduced rate, and some expose reasoning or thinking tokens as a separate category.

To make that clearer:

Input tokens are tokens you send in the request.
Output tokens are tokens generated.
Cached tokens define tokens reused from prior context (often billed at reduced rates where prompt/context caching exists).
Reasoning/thinking tokens: some systems expose additional internal “thinking” usage as a distinct count, and some explicitly say output pricing includes those thinking tokens.

Here are the prices for the most popular models (as of April 15, 2026):

This creates a strong incentive to be efficient. Reuse context where possible. Avoid unnecessary output. Keep prompts clear, but not bloated. A big part of working with AI is now operational: how well can you package and reuse tokens?

There is another twist here. Since providers charge per token, the “same” 1,000-character paragraph can cost more or less depending on the tokenizer, model, and language. Tokenization efficiency, meaning how many characters are packed into one token, directly changes both cost and latency for the same human-visible text.

That opens a bigger question: how much are companies optimizing tokenization and pricing for overall efficiency, and how much are they optimizing it for revenue?

What about token economy in open models?

Open models play a different role in token economics. They do not make money by charging you per token the way API providers do. Instead, tokens become your cost: the more tokens you process, the more compute, electricity, and infrastructure you need.

That changes where the optimization burden sits. With closed APIs, the provider handles inference infrastructure and sells you a managed token economy – you pay a predictable per-token price, and someone else worries about GPU utilization, batching, and serving efficiency. With open models, the main question shifts from "How much does this provider charge per million tokens?" to "How efficiently can my team run this model?"

And that question has a wide range of answers. Two companies deploying the same open model can end up with wildly different per-token costs depending on their optimization stack: quantization, batching strategy, KV cache management, speculative decoding, choice of serving framework. Open models do not eliminate token economics – they turn it into an engineering problem. The skill gap between teams is where the real cost divergence happens.

It is worth noting that the line between open and closed is not as clean as "run it yourself vs. pay per token." A growing ecosystem sits between 'run it yourself' and 'use a closed API.' Inference providers like Together and FriendliAI host open models and charge per token. Hugging Face serves as both the primary distribution hub for open models and an inference provider in its own right. GPU cloud platforms like CoreWeave and Lambda make it easier to self-deploy. Because anyone can stand up a competing endpoint with the same model, competition drives pricing down across both layers.

The reasons companies choose open models go beyond price. Data privacy, fine-tuning for domain-specific tasks, latency control, and avoiding dependency on a single provider all factor in. And the quality gap is narrowing: models like Llama, Qwen, and DeepSeek are competitive with proprietary systems on many tasks. The trade-off is less "cheaper but worse" and more "comparable, but you carry the engineering investment."

In practice, many companies end up with a hybrid setup – but not simply cheap open models for bulk and expensive closed models for quality. The split often follows a different logic: open models where customization, privacy, or volume economics justify the infrastructure investment; closed models where managed convenience, frontier capability, or speed of deployment matter more.

Conclusion

Why talk about the very basics of tokens? Because this is the foundation you cannot skip. Tokens drive how models work, how efficiently they run, how much context they can handle, and how much everything costs. Just as a model cannot process text without first breaking it into tokens, you cannot make good decisions about AI systems without understanding what tokens actually are and how they move through the stack.

Today, tokens are the unit that links model design, user experience, and business cost. They show up everywhere: in context windows, in latency budgets, in API pricing, in the engineering trade-offs between open and closed models. And the economy around them is still shifting — providers competing on price, open model teams competing on optimization, inference platforms compressing margins in between.

A good way to summarize where we are: tokens are now what bandwidth was for the early web and what compute-hours were for cloud infrastructure. And the smartest applications are often not the ones that send the most tokens, but the ones that decide which tokens are worth sending at all.

Share the newsletter

Sources and further reading

What are tokens and how to count them? | OpenAI article
Neural Machine Translation of Rare Words with Subword Units | Paper
BPE-Dropout: Simple and Effective Subword Regularization | Paper
Bit-level BPE: Below the byte boundary | Paper
WordPiece tokenization | Hugging Face article
SentencePiece: A simple and language independent subword tokenizer
and detokenizer for Neural Text Processing | Paper
Subword Regularization: Improving Neural Network Translation Models
with Multiple Subword Candidates | Paper
Tokens with Meaning: A Hybrid Tokenization Approach for Turkish | Paper
How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis | Paper
Attention Is All You Need | Paper
Context windows | Anthropic Docs