This website uses cookies

Read our Privacy policy and Terms of use for more information.

Quick answer: What is LLM inference?

LLM inference is the runtime process that turns a user prompt into a model answer. In a few seconds, the system tokenizes text, maps tokens into embeddings, computes attention, stores KV cache, retrieves extra context when needed, and generates the response one token at a time.

TL;DR: LLM inference is not just next-token prediction. A prompt moves through tokenization, embeddings, prefill, attention, KV cache, decode, batching, retrieval, and memory layers before becoming an answer. Modern inference is a latency, cost, and orchestration problem.

Over the past couple of years, inference has evolved from “the model just generates tokens” into one of the most complex engineering systems in AI. While you wait 2–3 seconds for a response, dozens of mechanisms are already working behind the scenes: tokenization, embeddings, attention, KV cache, request routing, retrieval, batching, memory management, and entire optimization pipelines.

In one of our earlier articles, we explained the core fundamentals of inference: key concepts, optimization techniques, and hardware trends. But that was a year ago, and the focus of the field is shifting extremely fast.

Inference now is more about system orchestration – a coordinated runtime system where all elements work together to produce an answer under latency and cost constraints.

Today we’re going to put all the pieces together into one pipeline. You’ll see the full path from tokens to generated answers, and we’ll answer the most interesting question: what actually happens in the 2.5 seconds between your prompt and the model’s response?

There is more going on there than most people realize.

But before, watch an episode of Attention Span, inspired by Demis Hassabis and OpenAI’s incredible achievement in math

In today’s episode:

  • LLM inference in two phases: prefill and decode

  • Prefill unpacked

    • The first layer: Tokens as the runtime currency

    • Embeddings: From token IDs to meaningful geometry

    • Attention: Where representations become context and prefill meets decode

  • What’s behind decode? The role of attention and KV cache

  • Context is not only inside the model

  • Inference optimization: batching, chunking, and parallelism

  • Why modern inference is system orchestration

  • Why attention is not the same as understanding

  • Sources and further reading

LLM inference in two phases: prefill and decode

When you write a prompt and send it to a model, a surprisingly complex pipeline starts running. But at the core, the process has two main stages: first, the model processes your request; then, it generates the response. One stage flows directly into the other:

  1. Prefill – this is the first stage, when the model reads the entire prompt and builds understanding of the context. Since all prompt tokens are already known, this step can be heavily parallelized and runs very fast on the GPU. Then prefill flows into →

  2. Decode – the model generates the response one token at a time. Each new token depends on the previous ones, so this stage is mostly sequential and slower.

The first output token usually takes the longest, because the model is still processing the whole prompt. After that, generation becomes a steady stream of tokens.

When many users send requests at once, inference systems try to balance several goals:

  • low latency, meaning fast responses

  • high throughput to serve many users efficiently

  • GPU memory efficiency

  • and right GPU utilization.

Speaking of latency, we need to distinguish between two important metrics:

Metric

What it measures

Main stage

What it affects

Time to First Token (TTFT)

The time between sending a prompt and receiving the first generated token

Mostly prefill latency

How fast the model starts responding

Time per Output Token (TPOT)

The average time required to generate each token after the first one

Mostly decode latency

How fast the response streams after generation begins

So, total latency is approximately: TTFT + (TPOT × number of output tokens).

And about the hardware, the key detail is that prefill requires more compute, while
decode is memory-bandwidth-bound.

But why does each phase use GPU differently? To understand that, and how systems can be optimized for efficiency and lower GPU usage, we need to look at how all the LLM workflow components – tokenization, embeddings, attention, and others – are distributed across prefill and decode.

There’s much more interesting stuff behind this pipeline than just a sequence of steps for processing text and generating responses.

Prefill unpacked

The first layer: Tokens as the runtime currency

Let’s start from the very beginning. Before a model can process and generate anything, text gets broken into tokens. The tokenization process creates these tokens: models split raw text into smaller pieces, which are then converted into numerical IDs. Depending on the tokenizer, a token can be a whole word, part of a word, punctuation, whitespace, or even a byte sequence, but it is always small enough to generalize, yet meaningful enough to preserve structure and semantics. In production, tokenization is effectively a learned compression layer sitting between human language and GPU compute.

However, this part of the workflow is not only about counting tokens. The way text gets split defines almost everything about modern AI systems: final sequence lengths, context limits, latency, memory usage, throughput, and even pricing.

Moreover, not all tokens are equal. A system needs to “understand” what exact kinds of tokens flow through it. An inference pipeline can involve the following token types which behave very differently:

  • Input tokens are relatively cheap because models process them mostly in parallel during the prefill stage.

  • Output tokens are more expensive because generation is sequential: the model predicts one token at a time. And they belong to decode stage.

  • Reasoning tokens can silently multiply compute usage by generating long internal chains of thought before the final answer appears.

  • Cached tokens reduce cost by reusing previously processed context.

  • Retrieval and tool-use tokens often dominate agentic systems because every loop adds more context back into the window.

This influences how people design AI systems, a lot. A long conversation, a RAG pipeline, or an autonomous agent is now fundamentally a token-management problem. The smartest systems appear to be the ones “deciding” which tokens are actually worth processing, storing, retrieving, or generating in the first place.

Tokenization happens before inference itself starts, but optimal tokenization and working with only the necessary tokens is one of the directions for optimizing compute and memory use.

Tokens are what the input consists of – now let’s look at how they start to come “alive” inside the model.

Embeddings: From token IDs to meaningful geometry

After tokenization the system only has token IDs – integers like 14382 or 5021. They are useless for the model until they reconstruct their meaning. In AI, this meaning is hidden in geometry.

An embedding layer maps every token ID to a dense vector – a learned coordinate in a high-dimensional space. The model then learns relationships between these representations through distance and direction. Similar concepts end up near each other, and this is the key to a total generalization (like generalizing from “cat” to “dog” or from “room” to “bedroom”) without memorizing every possible sentence individually.

Technically, this happens through an embedding matrix: a trainable lookup table where each token maps to a vector. During training, those initially random vectors organize into a semantic space where patterns emerge naturally.

Since models also need to know the order of tokens in the sequence, positional encodings are used to inject the actual order directly into the vectors. Many systems use a fundamental technique called RoPE (Rotary Position Embedding), which rotates embeddings in vector space based on token position, allowing attention layers to track relative distance between tokens efficiently. This concrete geometry is finally what the network can reason over.

Only after this step does the real computation begin →

Attention: Where representations become context and prefill meets decode

Don’t settle for shallow articles. Learn the basics and go deeper with us. Truly understanding things is deeply satisfying.

Join Premium members from top companies like Microsoft, NVIDIA, Google, HF, OpenAI, a16z, plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on in AI. 

FAQ

What is LLM inference?

LLM inference is the process where a trained model turns an input prompt into an output answer by processing tokens, computing attention, and generating new tokens.

Prefill vs decode: what is the difference?

Prefill processes the full prompt in parallel and mostly determines time to first token. Decode generates the answer one token at a time and mostly determines streaming speed.

Why does KV cache matter in LLM inference?

KV cache stores previously computed attention keys and values so the model does not recompute the whole context for every new generated token.

Why is decode slower than prefill?

Decode is sequential: each new token depends on previous tokens. It is also often limited by memory bandwidth because the system must repeatedly read model weights and KV cache.

Why is modern inference about orchestration?

Modern inference combines model execution with routing, batching, retrieval, memory, caching, and tool use. The final answer depends on how well the whole system manages context, latency, and cost.

Reply

Avatar

or to participate

Keep Reading