• Turing Post
  • Posts
  • Token 1.19: How to optimize LLM Inference

Token 1.19: How to optimize LLM Inference

Tools, Techniques, and Hardware Solutions


If you are someone who has been working in the field of machine learning (ML), then you must be aware of the fact that ML models, be they conventional ML models or Foundation Models (FMs)/Large Language Models (LLMs) present tricky deployment challenges. The process of training and deploying these models is iterative*. This iterative nature of ML deployment, combined with the fact that real-time and batch inferencing each come with their own constraints, makes ML inference challenging.

The addition of LLMs to the ecosystem of ML models has made things even more complicated and introduced new challenges.

*check Token 1.17: Deploying ML Model: Best practices feat. LLMs, where we discussed deployment in detail.

In today’s Token, we will cover:

  • New challenges introduced by FMs/LLMs.

  • How is model serialization different for LLMs?

  • I have powerful GPUs. Will that be sufficient to improve inference?

  • So, GPU + TensorRT is all I need?

  • AI accelerators

  • Conclusion

This article will be valuable for both seasoned ML practitioners and non-specialists. Experts will appreciate learning the differences between ML and FM inference and its optimization, while non-specialists will benefit from clear explanations of how these technologies work and their importance in the ML landscape.

New challenges introduced by FMs/LLMs

High Computational Cost

LLMs are extremely sophisticated models compared to conventional ML models. The table below shows the number of trainable parameters in some widely used LLMs.




340 M


1.5 B


1.8 T

Stable Diffusion

983 M

Compared to more traditional Convolutional Neural Networks (CNNs) models like AlexNet (62M parameters), ResNet (11M parameters) and Inception V3 (23M parameters), LLMs are more complex in terms of their architecture and the volume of data they are trained on. More parameters mean the model has more weights to adjust during training, which requires more memory and processing power. This is true for both CNNs when dealing with image tasks and LLMs for language tasks. But LLMs, due to their enormous size, entail higher computational costs for training and inference, often requiring specialized hardware like GPUs or TPUs and substantial electrical power, to process large LLMs efficiently. (check this deep dive into AI chips). This increases the cost and complexity of deploying these models for real-time applications, making it a significant consideration for practical use.

Memory Constraints

LLMs require an extensive memory to run. For instance, Falcon 7B: one of the smallest LLM, requires at least 16GB of RAM to run comfortably. In comparison, conventional ML models like SVM, decision trees, and some neural networks can be run on less than 500 MB of RAM.

ResNet, one of the most advanced CNNs, can be loaded using <6 GB of RAM. On the contrary, models like GPT 3 and BLOOM require 350GB+ memory.

There is a bright side though: while memory constraints present challenges for LLM inference, ongoing advancements in optimization techniques, hardware, and cloud computing are helping to mitigate these issues. Understanding and leveraging these developments is crucial for effectively deploying and utilizing LLMs and other memory-intensive models.

Immature Tooling

LLMs are comparatively new, so it is no surprise that the ecosystem of tools built around LLMs is still immature (but growing fast!). While there are tools that ease the process of deployment and inference, any issues that are peculiar to an organization or a specific model will require a custom solution.

There are already some tools and techniques that can ease the process of running inference when using LLMs. Let’s start with model serialization.

With a conventional model trained using libraries like scikit-learn, TensorFlow, and PyTorch, you could generate a pickle file* and use it for inference. To streamline the process and make it efficient in a team where engineers use a variety of libraries, you could use tools like ONNX to serialize the model such that it is invariant to the library that actually produced the model artifact.

*A pickle file is a way to save Python objects, like lists or models, so you can load and use them later.

How is model serialization different for LLMs?

The rest of this article, loaded with useful details, is available to our Premium users only. Please →

Thank you for reading, please feel free to share with your friends and colleagues. In the next couple of weeks, we are announcing our referral program 🤍

Previously in the FM/LLM series:

Join the conversation

or to participate.