Large vs Small Language Models: Differences & Inference Cost

Small vs Large Language Models: Key Differences

Today’s Token looks at a big question in AI: should we keep making larger language models (LLMs) or switch to smaller, more efficient ones? We'll explore what each choice means for AI's future.

LLMs are impressive, handling complex tasks well. Recently released Falcon has 180 billion parameters and is trained on 3.5 trillion tokens! But large models need a lot of power and aren't easy to use everywhere.

Small Language Models (SLMs), with 10 billion parameters or less, can do similar jobs but don't require as much power. They're easier to manage and cost less. A central theme is model compression, essential for converting large models into efficient, smaller versions. Techniques like pruning, quantization, and knowledge distillation are crucial, especially for resource-limited applications.

Finally, we’ll talk about starting with small models from the beginning. This means using less data and focusing on specific tasks, which can be a smarter way to build AI. This article will discuss the pros and cons of both big and small AI models, helping you understand which might be best for different uses. Let’s dive in:

Comparison
Solution – Model Compression (for LLMs):
- Pruning
- Quantization
- Knowledge distillation
- Low-rank factorization for model compression
Don’t compress – start small
List of Small Models
Conclusion

LLM vs SLM: Cost, Performance & Use Cases Compared

LLMs are truly powerful in language processing but have practical challenges due to their large size and high computational requirements. These challenges become especially significant in environments with limited resources. The main costs affecting their development and use include:

Computational resources: Training requires significant processing power, often needing GPUs or TPUs, which leads to high energy costs for prolonged training periods.
Storage: These models and their datasets need a lot of storage space, requiring high-capacity and fast-access storage systems.
Maintenance and updates: They need ongoing maintenance, including regular monitoring, updating for accuracy and relevance, and periodic retraining.
Infrastructure and support: Operational use requires robust infrastructure like servers and networks, along with support systems for deployment and monitoring.
Energy consumption and environmental impact: Their high energy use, particularly during training, leads to higher costs and potential environmental impacts. Efforts to use renewable energy sources also contribute to costs.

Smaller foundation models offer several benefits:

Affordability and accessibility: They are less expensive and require fewer resources, making them more accessible to individuals and small teams.
Adaptability and innovation: These models are easier to modify and fine-tune, allowing for the integration of specific data and adjustments to meet unique needs, promoting innovation.
Data governance and privacy: They can operate on local systems without needing high-end GPUs, giving users more control over their data and enhancing privacy.
Faster development and testing: Their smaller size and simplicity support quick prototyping and experimentation.

Model Compression for LLMs: 4 Key Techniques

As big models face more problems, model compression, the study of making them smaller and more efficient, is gaining importance. To understand how large models can be transformed, let’s examine specific characteristics of the model that can be adjusted or reduced while preserving performance.

These key metrics include:

Number of parameters: Total count of learnable weights in a model. More parameters generally mean greater expressiveness, but also higher demands for computational resources and memory during training and inference.
Model size: The disk space or memory needed to store the entire model, including weights, biases, and other components. This size is influenced by the number of parameters, the data type of parameters, and the model architecture.
Compression ratio: The ratio between the size of the original model and a compressed model with retained performance. A higher ratio means more efficient compression.
Inference time: How long the model takes to process input data and generate responses.
Floating Point Operations (FLOPs): A number of arithmetic operations involving floating-point numbers performed by the model during data processing. FLOPs help estimate a model’s computational requirements and compare the efficiency of different models or compression techniques.

In the following exploration of model compression techniques, we want to focus on large language models (LLMs) as the most applicable type of foundation models (FMs).

Model compression for LLMs

A recent paper “A Survey on Model Compression for Large Language Models” presents a very extensive analysis of model compression techniques up to date.

Source: ‘A Survey on Model Compression for Large Language Models’

Each of the named techniques is a separate area of research. What are they? →

Pruning

This involves removing certain weights or neurons from the model that contribute the least to its output, thereby reducing its size. The removal of these superfluous elements helps in reducing the model's size, enhancing its storage-friendliness, and improving memory and computational efficiency. Additional resources:

Optimal Brain Damage (OBD) is one of the early techniques for reducing the size of a learning network by selectively deleting weights →read more (no actual brain damage happens!)

Pruning is categorized into two types:

Unstructured Pruning involves removing individual parameters, leading to a network with a sparse and irregular structure. Additional resources:

Structured Pruning, on the other hand, involves removing connections or hierarchical structures according to specific rules while maintaining the overall network structure. Additional resources:
- LLM-Pruner: On the Structural Pruning of Large Language Models

Quantization

Quantization is a technique used to reduce the size and increase the efficiency of deep learning models. It converts the model's parameters, which are usually stored in a high-precision format (like 32-bit), into a lower-precision format (like 16-bit). This change reduces the amount of memory the model needs and speeds up its computations.

There are two main approaches to quantization:

Quantization-Aware Training (QAT): In this approach, quantization is part of the model's training process. The model learns to operate with lower precision from the beginning. Advanced methods in QAT, like LLM-QAT, focus on significantly reducing the size of large models while maintaining their effectiveness. Other techniques like PEQA and QLORA also aim to fine-tune models efficiently using less memory.
Post-Training Quantization (PTQ): This approach is applied after the model has been fully trained. It involves reducing the precision of the model's parameters to decrease its size. Although simpler than QAT, PTQ might slightly reduce the model's accuracy because the model wasn’t originally trained to work with lower precision. Techniques like LUT-GEMM and LLM.int8() are examples of PTQ, focusing on optimizing the model's weight parameters for better performance.

Different levels of quantization are used depending on the needs:

8-bit Quantization: This is a commonly used level that offers a good balance between reducing size and maintaining performance. Additional resources:
Lower-bit Quantization: This involves reducing the precision to less than 8 bits. It's more challenging but can make the models even smaller and faster. Methods like LLM-QAT and PEQA are examples of this approach. Additional resources:

Knowledge Distillation

Knowledge Distillation (KD) in ML involves transferring knowledge from a complex model (teacher) to a simpler one (student). These methods can be categorized into two subsets:

White-box KD: the student not only gets the teacher's predictions but also insights into its parameters. This deeper understanding leads to better performance. Additional resources:
- MiniLLM focuses on minimizing reverse Kullback-Leibler divergence (KLD) to refine the quality of generated samples →read more
- Generalized Knowledge Distillation (GKD) addresses distribution mismatches and model under-specification by optimizing alternative divergences like reverse KL →read more
- TF-LLMD uses a truncated model with a subset of layers from a larger model for initialization and training on pretraining data using a language modeling objective. -> read more
Black-box KD: only the predictions of the teacher LLM are available for distillation.
- In-Context Learning (ICL) Distillation: This method leverages the ability of LLMs to learn from a few examples provided in the prompt itself. ICL distillation is about transferring this capability to SLMs. Additional resources:
  - A Survey on In-context Learning
  - Large Language Models Are Implicitly Topic Models: Explaining and Finding Good Demonstrations for In-Context Learning – arXiv Vanity
  - Applying Meta In-context Tuning (Meta-ICT) and Multitask In-context Tuning (Multitask-ICT): In-context Learning Distillation: Transferring Few-shot Learning Ability of Pre-trained Language Models
- Chain-of-Thought (CoT) Distillation: CoT involves embedding intermediate reasoning steps into the prompts. This method is particularly important for training models that not only solve problems but also provide understandable rationales behind their solutions. Additional resources:
  - MT-COT: Explanations from Large Language Models Make Small Reasoners Better
  - CoT Prompting: Teaching Small Language Models to Reason
  - Fine-tune-CoT: Large Language Models Are Reasoning Teachers
  - SSLM: Specializing Smaller Language Models towards Multi-Step Reasoning
  - SCOTT: Self-Consistent Chain-of-Thought Distillation
  - Distilling Step-by-Step: Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
  - SOCRATIC CoT: Distilling Reasoning Capabilities into Smaller Language Models
  - PaD: Program-aided Distillation Specializes Large Models in Reasoning
  - LMTWA: Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Personalization
- Instruction Following (IF) Distillation: IF is focused on enhancing the ability of models to comprehend and execute tasks based solely on instructions. This is crucial for creating models that are not just powerful but also user-friendly and adaptable to a variety of real-world scenarios.

Low-Rank Factorization

Low-rank factorization is a technique used to compress models by breaking down a large weight matrix into smaller, more manageable matrices. This method reduces the number of parameters and lessens computational demands. It helps fine-tune LLMs more efficiently. Additional resources:

Low-rank factorization for model compression:
- Language model compression with weighted low-rank factorization
- TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition
Low-rank factorization for efficient LLM fine-tuning:
And more. We’ll explore LoRA in more detail in the following editions

Model compression is a crucial area of research, especially for deploying LLMs in resource-constrained environments or for applications that require real-time responses. The goal is to find a balance between model size and performance, ensuring that the compressed model is still effective for its intended tasks.

Start Small: Building Efficient Models from Scratch

Another way to achieve smaller models with the capabilities of the larger ones is to start small. This approach is proposed in "Techniques to Make Large Language Models Smaller: An Explainer," by the Center for Security and Emerging Technology, as an alternative to model compression.

This strategy focuses on understanding and addressing the inefficiencies in training large models. The key points are:

Design models with fewer parameters: Create effective models with significantly fewer parameters. These models are capable of running on just one high-end GPU and typically have under 7 billion parameters. Mistral's Ministral 3B and 8B are a practical example of this philosophy — see Inside Les Ministraux for a full breakdown.
Curate quality datasets: Instead of relying on large datasets of uncleaned internet data, make an effort to prepare high-quality data. Although it takes more effort to obtain, working with smaller models means you'll need less of this curated data.
Focus on specific tasks or domains: Large language models excel in a variety of domains and tasks. However, for real-life applications, a specialist model is often more useful than a generalist one. Specialist models focus on a specific task or domain, requiring fewer parameters since they only need to be experts in one area. This approach is akin to mastering one trade rather than being a jack of all trades.
Efficient fine-tuning: Although fine-tuning usually requires less computational power than initial training, it remains a resource-intensive process. When fine-tuning smaller models, it's beneficial to use efficient techniques, such as updating only the most crucial parts of the model while keeping the rest unchanged.

Small Language Models List: Examples & Resources

You can use Hugging Face Open LLM Leaderboard to find the latest small open-source language models. Filter the models using the number of parameters and also sort them by the performance. There are so many of them!

SLM vs LLM: Which Should You Choose?

As we've navigated the complexities of LLMs and small language models SLMs, one thing is clear: the future of AI isn't just about size, but about smart, efficient choices. Yes, annual hardware advancements could further accelerate the growth of large models, revealing yet unknown properties of LLMs. But optimization methods are making models more efficient without hitting diminishing returns in LLMs scaling. Whether you're captivated by the vast capabilities of LLMs or the agility and resourcefulness of SLMs, understanding the nuances of each is crucial for harnessing their full potential.

Our deep dive into model compression techniques for LLMs and the strategic approach of starting with smaller models offers a roadmap for those looking to optimize AI for various applications. From the intricate processes of pruning and quantization to the innovative realms of knowledge distillation and low-rank factorization, we've explored paths to make large models more manageable and small models more powerful.

Thank you for reading, please feel free to share with your friends and colleagues. In the next couple of weeks, we are announcing our referral program 🤍

Previously in the FM/LLM series:

Token 1.10: Large vs Small in AI: The Language Model Size Dilemma