Introduction
If you are someone who has been working in the field of machine learning (ML), then you must be aware of the fact that ML models, be they conventional ML models or Foundation Models (FMs)/Large Language Models (LLMs) present tricky deployment challenges. The process of training and deploying these models is iterative*. This iterative nature of ML deployment, combined with the fact that real-time and batch inferencing each come with their own constraints, makes ML inference challenging. New to LLM inference? Start with: What is LLM Inference?
The addition of LLMs to the ecosystem of ML models has made things even more complicated and introduced new challenges.
*check Token 1.17: Deploying ML Model: Best practices feat. LLMs, where we discussed deployment in detail.
In today’s Token, we will cover:
New challenges introduced by FMs/LLMs.
How is model serialization different for LLMs?
I have powerful GPUs. Will that be sufficient to improve inference?
So, GPU + TensorRT is all I need?
AI accelerators
Conclusion
This article will be valuable for both seasoned ML practitioners and non-specialists. Experts will appreciate learning the differences between ML and FM inference and its optimization, while non-specialists will benefit from clear explanations of how these technologies work and their importance in the ML landscape.
LLM Inference Challenges: Memory, Compute, and Latency
High Computational Cost
LLMs are extremely sophisticated models compared to conventional ML models. The table below shows the number of trainable parameters in some widely used LLMs.
Model | Parameters |
BERT | 340 M |
GPT 3 | 1.5 B |
GPT 4 | 1.8 T |
Stable Diffusion | 983 M |
Compared to more traditional Convolutional Neural Networks (CNNs) models like AlexNet (62M parameters), ResNet (11M parameters) and Inception V3 (23M parameters), LLMs are more complex in terms of their architecture and the volume of data they are trained on. More parameters mean the model has more weights to adjust during training, which requires more memory and processing power. This is true for both CNNs when dealing with image tasks and LLMs for language tasks. But LLMs, due to their enormous size, entail higher computational costs for training and inference, often requiring specialized hardware like GPUs or TPUs and substantial electrical power, to process large LLMs efficiently. (check this deep dive into AI chips). This increases the cost and complexity of deploying these models for real-time applications, making it a significant consideration for practical use.
Memory Constraints
LLMs require an extensive memory to run. For instance, Falcon 7B: one of the smallest LLM, requires at least 16GB of RAM to run comfortably. In comparison, conventional ML models like SVM, decision trees, and some neural networks can be run on less than 500 MB of RAM.
ResNet, one of the most advanced CNNs, can be loaded using <6 GB of RAM. On the contrary, models like GPT 3 and BLOOM require 350GB+ memory.
There is a bright side though: while memory constraints present challenges for LLM inference, ongoing advancements in optimization techniques, hardware, and cloud computing are helping to mitigate these issues. Understanding and leveraging these developments is crucial for effectively deploying and utilizing LLMs and other memory-intensive models.
Immature Tooling
LLMs are comparatively new, so it is no surprise that the ecosystem of tools built around LLMs is still immature (but growing fast!). While there are tools that ease the process of deployment and inference, any issues that are peculiar to an organization or a specific model will require a custom solution.
There are already some tools and techniques that can ease the process of running inference when using LLMs. Let’s start with model serialization.
With a conventional model trained using libraries like scikit-learn, TensorFlow, and PyTorch, you could generate a pickle file* and use it for inference. To streamline the process and make it efficient in a team where engineers use a variety of libraries, you could use tools like ONNX to serialize the model such that it is invariant to the library that actually produced the model artifact.
*A pickle file is a way to save Python objects, like lists or models, so you can load and use them later.How is model serialization different for LLMs?
LLMs are new and hence, as discussed before, there are not many tools for LLM serialization. Python’s pickle is already problematic: it is insecure, slow, and unreadable.
ONNX Runtime supports only a limited number of LLMs such as LLaMA, GPT Neo, BLOOM, BERT, GPT2, T5, Stable Diffusion, and Whisper.
The good news is that the top 30 most popular model architectures on Hugging Face are all supported by ONNX Runtime. These supported models can also be easily deployed to the cloud through Azure Machine Learning.
Besides, AWS Sagemaker also allows users to deploy popular models for text and image generation in a couple of steps.
GPU vs TensorRT for LLM Inference Optimization
When it comes to inference time, GPUs are faster compared to CPUs. Meaning, that the same model when deployed on a GPU (without any optimization like pruning and quantization – we’ve covered these techniques in Token 1.10) will be faster than the one deployed on a CPU. But, as an engineer, you shouldn’t stop until you squeeze all the performance out of your model and hardware.
Here comes TensorRT. It is a runtime optimization tool specifically designed for deep learning models. It focuses on running a trained network quickly and efficiently on NVIDIA hardware. TensorRT takes a trained network, which consists of a network definition (graph) and a set of trained parameters, and produces a highly optimized runtime engine that performs inference for that network. TensorRT uses techniques like pruning, quantization, layer fusion, and kernel tuning to improve resource utilization and inference speed.
It is important to note that TensorRT has separate SDKs for Pytorch, Tensorflow, and Large Language Models. TensorRT-LLM provides Python API to define LLMs and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. By using TensorRT-LLM, LLMs can benefit from faster inference, reduced memory footprint, and improved overall performance. This blog demonstrates how users can optimize inference on LLMs using NVIDIA TensorRT.
Overall, TensorRT plays a crucial role in optimizing LLM inference, enabling smoother deployment and improved performance for these complex models. For a broader look at how the inference hardware landscape is evolving — from NVIDIA Vera Rubin to MatX and Taalas — read our deep dive on the Inference Chip Wars.
So, GPU + TensorRT is all I need?
Actually, you can be even more efficient by choosing the right hardware. ML inference can be done either in real-time or batch.
For real-time inference, if the model can easily fit on a CPU, it is better to use the CPU as it reduces the operation cost.
For batch inference, it is always recommended to avoid CPUs. CPUs aren’t efficient when you need high parallelism and memory bandwidth and hence should be avoided. GPUs and TPUs on the other side are specifically made for complex matrix manipulation and hence are ideal for batch inference and heavy models like modern-day LLMs. The graph below compares runtime (training) for CPU, GPU, and TPU for an LSTM network*.
*LSTM, or Long Short-Term Memory, is a type of Recurrent Neural Network (RNN) designed to remember information for long periods, making it ideal for processing sequences, like language or time series data.
Image Source: IEEE Xplore
Even though the experiments are performed using LSTM (for training), you will find a similar trend with other models including LLMs for inference when using a batch workload.
The graph above further proves that for larger batch sizes (parallel processing), TPUs and GPUs are more efficient. In the image, you can see that for a set number of nodes, as the batch size increases performance of the CPU decreases while that of GPU and TPU increases with TPU being the optimal choice for larger batch sizes.
AI accelerators
Operating GPUs and TPUs might be infeasible for you if you are an individual or small startup. Furthermore, you may lack the manpower to maintain this hardware. In such cases, you can opt for cloud providers that provide such accelerators. (Check this lecture from Cornell University about AI accelerators.) The most known examples are AWS Trainium and AWS Inferentia.
AWS Trainium is an ML accelerator built and optimized by AWS for training models with over 100 billion parameters. This is ideal for training large language models.
Similarly, AWS Inferentia is designed by AWS to deliver the highest performance-to-cost ratio for deep learning inference applications. Alexa, the voice assistant from AWS, is deployed using Inferentia. This page provides estimated costs for deploying common language models for text and image generation.
Here you can find a list of AWS alternatives, including such accelerators as Run:ai, DataRobot, Together AI, Nvidia offerings, and others.
Finally, there are a few other tools that might ease the process of ML training and inference with LLMs. They are:
Hugging Face Inference Endpoints: The service allows users to deploy Hugging Face models (Transformers, Diffusers, etc.) in just a few clicks without having to go through the hassle of setting up infrastructure and containerizing the application.
Amazon Sagemaker Jumpstart: It is an ML hub that helps you deploy ML models on AWS infrastructures in a few clicks. It provides access to numerous proprietary and publicly available models. Proprietary models include models from providers like AI21, Cohere, etc. Public models include ones from Hugging Face (Stable Diffusion, Falcon, etc.), Meta (Llama 2, Open LlaMa), etc.
Conclusion
For LLMs, inference can be cumbersome. The extreme computational cost, high memory requirements, and the nascent state of tooling within the LLM ecosystem make it challenging to deploy LLMs efficiently. However, companies like NVIDIA are investing in advanced tooling solutions like TensorRT for LLMs, which assist ML engineers in running LLMs optimally on GPUs. This is complemented by the development of specialized libraries and frameworks designed to streamline LLM deployment, enhancing compatibility and performance across diverse computing environments. Moreover, when a high level of parallelism for batch inference is required, TPUs can be a go-to solution. Additionally, for those not inclined to own and maintain such hardware, turning to canonical cloud offerings (AI accelerators) is a viable option.
The infrastructure for foundation models/LLMs is evolving rapidly, and soon, we will witness the emergence of more tools and techniques for optimizing Large Language Model inference, further easing the deployment and operational challenges associated with these powerful models.
Thank you for reading, please feel free to share with your friends and colleagues. In the next couple of weeks, we are announcing our referral program 🤍
Previously in the FM/LLM series:










