This website uses cookies

Read our Privacy policy and Terms of use for more information.

Was this email forwarded to you? Forward it also to a friend or a colleague! Sign up

Powerful reasoning models that take significant time to generate a final response remain a major focus in the AI community. They use test-time scaling to achieve state-of-the-art performance, often at the cost of slower response speeds. Our article on test-time compute gained a lot of interest from our readers, which is why we decided to dive deeper into a closely related topic – inference.

All chatbots, image generators, voice assistants, and recommendation systems rely on inference; it’s what users actually interact with. That’s why understanding the core aspects of inference is important for everyone, not just developers. Once a model is trained, inference is the stage where it demonstrates the knowledge it has learned. Making inference faster and enabling models to process many inputs simultaneously are among the most important challenges in optimization. Today, we’ll explore the fundamentals of inference – everything you need to know to feel confident with this part of a model’s function, including key concepts, workflows, types, optimization techniques, and of course, some latest developments in the field. For a full explainer on what LLM inference is and how it works, see: What is LLM Inference?

This article isn’t meant to be exhaustive but aims to clarify why inference matters so much in AI today.

What’s in today’s episode?

  • What Is AI inference?

  • Key concepts: Latency, throughput and a little bit more

  • How Inference Works in Transformers

  • Inference in Diffusion Models

  • Types of Inference

  • Optimization Techniques

  • Current Trends in Hardware

  • Conclusion

  • Sources and further reading

What is AI inference?

During training, a model learns essential patterns from data. Once it's properly trained, it's time for the model to demonstrate its acquired “knowledge” in real-world tasks. The process that begins with prompting and ends when the model generates an answer to a query is called inference. In other words, inference is the process of using a trained model to make predictions or generate outputs.

Achieving efficiency of inference is extremely crucial – as once IBM estimated in their survey, up to 90% of an AI model’s lifetime compute consumption happens during inference rather than training.

We even started to talk about new scaling laws, thanks to inference. In one of his recent presentations, NVIDIA CEO Jensen Huang has distinguished three scaling laws shaping modern AI:

  • Pre-training scaling (training on large data sets)

  • Post-training scaling (fine-tuning the model)

  • Test-time scaling (inference-time compute)

Jensen Huang emphasized that inference scaling is becoming increasingly critical. Models now perform complex, multi-step reasoning at inference – an approach called "test-time compute" or "inference scaling." By significantly increasing the computational resources allocated at this stage, AI systems achieve better performance, accuracy, and reliability, particularly in challenging applications requiring careful thought or nuanced reasoning.

To be fluent in AI inference, we need to understand its main aspects, such as the concepts used to measure the effectiveness of inference from different perspectives and how the inference process works within the model. Let's start with the concepts first.

LLM Inference Latency, Throughput, and TTFT: Key Concepts

Key concepts of inference include:

  • Inference time is the raw measure of how long the model takes to compute the output.

  • Total generation time = full time from prompt to complete response. It closely refers to one of the most important aspects of inference →

  • Latency – time to get one complete result. It starts when the request (prompt) is sent and ends when the first output token or the full response is received. Latency is usually measured in milliseconds (ms). In general, to count latency we need:

    • Time to First Token (TTFT): It measures how long it takes for the model to start responding. It’s the time from when you send a prompt to the model until the first generated token is returned.

    • Time Per Output Token (TPOT): It is the average time it takes to generate each output token after the first one is generated. It shows how fast the model generates answers once it starts.

    So, approximately: Latency = TTFT + (TPOT × number of output tokens).

    This metric is used in streaming settings, when the model starts sending tokens as soon as it generates them, one-by-one or in small chunks.

    In non-streaming settings, when the model generates the entire response first, and only then sends it back to the user, TTFT = total latency. Total generation time is usually the same as latency in non-streaming setting.

  • Throughput – the number of successful inferences (or generated tokens) the system can complete per second. It shows how much the system can handle overall. Throughput is often measured in requests per second (RPS) or tokens per second (TPS).

However, there is often a trade-off: optimizing for latency can hurt the throughput and vice versa. Therefore, the important task for developers is to balance them effectively, achieving minimized latency and maximized throughput.

Other important concepts of inference, which developers also pay attention to, are:

  • Cost per inference – it’s a core concept of inference efficiency. It shows the computational cost (and often $ cost) of running a single inference. Companies running models at scale care a lot about cost per token/request. Overall, cost per inference affects total pricing, latency, and scalability.

  • Scalability is the ability to maintain performance as inference load increases.

  • Accuracy shows how correct or useful the model's output is when given an input. Even though the model is already trained, we should still care about how well it performs during inference on new data.

  • Entropy tells you how "confident" or "uncertain" the model is when generating the next token. High entropy means more randomness, low entropy – more certainty.

This is what we can observe outside the model to estimate the inference process. However, we are here to explore inference in a bit more technical depth, so →

How Inference Works in Transformers

Here’s a simplified step-by-step flow of full inference process in LLMs:

  1. Tokenizing input: First, raw text from the query is turned into token IDs – numerical representations that the model can understand. These numbers correspond to entries in the model’s vocabulary.

  2. Embedding the tokens: Each token ID is mapped to a dense vector (called embedding), which encodes semantic meaning. These vectors are retrieved from the embedding matrix. Positional information is added via RoPE or learned positional embeddings to help the model understand the order or tokens. The output of this stage is a sequence of embeddings, one for each token.

  3. Running through the Transformer layers: This is where the core computation begins. The LLM processes the sequence through multiple Transformer layers, which include self-attention and feedforward networks (MLPs), to extract context and meaning. The model applies learned parameters (weights) at each layer to produce predictions.

  4. Generating logits: The model produces unnormalized scores, called logits, for each possible next token in its vocabulary.

  5. Decoding the next token: The next token is selected based on the logits, using different generation strategies, such as sampling, greedy decoding, top-k, or beam search.

  6. Repeating until end or max tokens: The process repeats from step 3, now including the newly generated token as part of the input. It continues until a stop token is generated, a max token limit is reached, or some other stopping condition is met.

  7. Output generation: The final output tokens are converted back into human-readable format for the end-user or system.

These steps are hidden from the user, but each step is important for ensuring the inference is fast, correct, and reliable. But not all inference scenarios are the same.\

Inference in Diffusion Models

In diffusion models, the inference process is different. The initial input is random noise. Prompts are used to guide the generation toward a desired output, such as a specific image. The generation process involves iterative denoising steps, where the model gradually transforms the noise into coherent data. The number of denoising steps can vary, for example from 25 to 1000, depending on the sampling strategy and quality-speed tradeoff. Techniques like DDPM, DDIM, or score-based sampling define how the noise is removed over time. Because of the many forward passes required, inference in diffusion models tends to be slower compared to transformers, which generate outputs more directly.

Types of inference

Here we’ll look at several types of inference that are mostly used. In practice, they are often combined together depending on the purpose of the system.

Real-time inference

Systems with real-time inference make predictions on-demand, in response to live data or user actions. They deliver low-latency results, essential for time-sensitive tasks, such as fraud detection or medical alerts. Real-time inference is also used in voice assistant or self-driving car. It provides immediate feedback and dynamic updates, but requires always-on infrastructure, careful scaling to handle simultaneous users, and high system reliability.

Batch inference

The opposite of real-time inference is batch inference, where large sets of inputs are collected and processed in bulk, usually on a schedule. It is an offline inference type used for tasks like processing millions of records. It maximizes throughput and efficiency, but can’t process queries immediately.

Cloud inference

Cloud inference uses centralized servers, sending data from user devices to the cloud for processing. It allows powerful, scalable computation using high-end GPUs and TPUs, ideal for large or complex models. It provides easy updates and efficient resource sharing. However, there are some drawbacks, such as network latency, high cost and the rise of privacy concerns as data leaves the device. It is often paired with edge inference for optimal performance.

Edge inference

Edge inference runs AI models directly on local devices like phones, IoT sensors, or drones, enabling real-time processing at the source without sending data to the cloud. It is essential for fast and private AI decisions. However, edge devices have limited compute power, and it’s challenging to run large models without optimization on them. Updating models on multiple devices is also complex, and scaling needs more hardware deployment.

All these types can be mixed up together, for example, the system can do quick initial inference on the edge and send data to the cloud for more in-depth analysis, or perform real-time inference in the cloud by deploying servers in many regions.

As you can see from the example of these inference types, inference encounters many limitations. Various optimization techniques are widely used to address them.

Optimization techniques

Here are seven main techniques for making inference faster and more efficient:

  1. Batching: It groups multiple requests into one matrix operation, allowing to run multiple inferences together as a batch on the hardware. The result is a much higher throughput though it can increases the latency.

  2. Quantization: This technique reduces the numerical precision of a model’s weights and operations, for example, from 32-bit floating point (float32) to lower precision like 8-bit integers or float16. This shrinks model’s size in memory, requiring less computation and increasing the speed of inference with small accuracy loss.

  3. Model distillation: Using a smaller model trained to mimic a large one’s predictions and behavior leads to more efficient inference time.

  4. Speculative decoding: You can also use a smaller model to guess multiple tokens ahead, then verify with the big model. By validating several tokens in fewer steps, speculative decoding speeds up the generation process, sometimes reducing latency by 30% or more.

  5. Pruning: This involves removing parts of the model that are unnecessary or redundant in making predictions, which leads to model compression and faster running.

  6. Hardware acceleration: Using the right powerful hardware, such as Graphics processing units (GPUs), tensor processing units (TPUs), as well as various AI accelerators and NPUs, can significantly boost inference speed.

  7. KV Caching (Key/Value Caching): In transformers, attention layers compute over all previous tokens. During inference, the model cache keys and values from past tokens to avoid recomputation.

These techniques are widely used in current AI systems and have even become traditional methods for achieving faster inference. But what are the actual trends in AI inference?

Since the release of OpenAI o1 and DeepSeek-R1, everyone has started focusing on more "thoughtful" reasoning processes that involve more test-time compute and take more time to generate a final response. It may seem that the community is not as interested in reducing inference time. On the contrary, with advanced reasoning models, the challenge of achieving fast inference has reached the next level. Most advancements in this area are aimed at building more powerful hardware.

For example, in Cerebras Systems tripled the inference speed of its CS-2 platform through a software update, reaching 2,100 tokens/second on a 70B-parameter Llama 3.2 model, which is 16× faster than GPUs and 68× faster than typical cloud services. This leap came not from new hardware, but from algorithmic and compiler optimizations that improved parallelism and memory access on its wafer-scale engine. Dubbed “Instant Inference,” this breakthrough enabled real-time applications like voice assistants and chatbots to respond faster than ever before. Early adopters saw up to 60× reductions in latency, proving software innovation alone can deliver massive inference gains.

Another major player, NVIDIA, is also making strides with its hardware breakthroughs. Just recently, in March, its Blackwell GPUs with 4-Bit Precision combined with software optimizations like quantization and faster kernels​ achieved 36× higher inference throughput and ~32× lower cost per token. It can serve over 250 tokens/second per user on massive models while maintaining low latency.

Image credit: NVIDIA Blackwell blog

Another important development from NVIDIA is Dynamo distributed inference framework. It is specifically build for low-latency and boosts inference throughput by up to 30× by cleverly orchestrating how inference workload is divided and scheduled across GPUs and nodes. Dynamo separates the prefill phase (embedding the prompt through the model) from the decode phase (generating outputs) and can run them on different GPUs concurrently. It dynamically allocates GPUs to handle fluctuating request volumes and also reuses key/value caches to avoid recomputing them for similar requests.

Image Credit: NVIDIA Dynamo blog

These developments demonstrate how advanced hardware can handle the best models with high requirements, and it is not exactly the limit in inference effectiveness. As models improve, more powerful hardware will be developed.

What Else is Happening in Inference Right Now?

In addition to hardware leaps, some of the biggest breakthroughs are happening in the software layer – changing not just how fast inference can be, but how smart it can become.

Take speculative decoding, for example. Instead of generating one token at a time, this clever trick uses a smaller draft model to guess multiple tokens ahead. The large model then verifies them in one go — often cutting generation time in half. This technique powers faster responses in systems like GPT-4 and even real-time transcription in Whisper.

Another important open-source development is vLLM, a new inference engine built by researchers at Berkeley. It introduced “PagedAttention” — a way to reuse memory and dynamically batch requests. As a result, it achieves up to 24× higher throughput compared to standard Transformers libraries, even on the same GPU.

What’s exciting is that as inference becomes faster, models can afford to reason more deeply at runtime. Cerebras recently showed that with extreme speedups, a 400B-parameter model could evaluate hundreds of reasoning paths in real time — improving the quality of answers instead of rushing to the first solution.

So while inference optimization has always been about speed, the new era is also about strategic deliberation. Faster systems do both: respond quicker and think better.

Conclusion

Today, we explored a crucial topic that serves as the "face" of all AI systems: AI inference. This is what everyone experiences when interacting with AI. This aspect of model functioning drives progress in hardware and various techniques that can enhance inference effectiveness. We now face a two-sided challenge: reasoning models like DeepSeek R1 are increasing inference time, while researchers continue to find more advanced ways to speed up inference despite these increased demands. And it's like endless progress because we all still need systems that are as fast as possible.

There is much much more to cover in inference. Please suggest your topics in the comment section.

Sources and further reading

Resources for further reading:

Turing Post resources

Reply

Avatar

or to participate

Keep Reading