LoRA (Low-Rank Adaptation) for LLMs: How It Works

What is LoRA (Low-Rank Adaptation)?

As the utilization of Large Language Models (LLMs) intensifies across various domains, the concept of fine-tuning garnered significant attention. Particularly in the context of billion-parameter models, fine-tuning is often seen as a resource-intensive endeavor. This has led to a focus on optimization methodologies, one of which is Low-Rank Adaptation (LoRA).

And today we are going to discuss it in detail. However, to fully appreciate its significance, it's crucial first to understand the necessity of fine-tuning in the LLM landscape and where it stands in comparison to other adaptation techniques.

Let’s dive in:

Comparing LLM adaptation techniques: Identifying the necessity of fine-tuning
Key scenarios where fine-tuning is indispensable
Intuition behind LoRA
How LoRA works
The benefits of LoRA

Comparing LLM adaptation techniques

Adapting LLMs to your specific needs is key to making the most out of these powerful tools in your business. Here are some straightforward ways to do this:

Prompt Engineering: Designing specific input prompts that guide the model to apply its general knowledge in a way that's relevant to the task. Simply put, you need to phrase questions or requests in a way that effectively communicates what you want the AI to do or the type of information you want it to provide.
- Few-shot or Zero-shot Learning: These techniques involve providing the model with a few or no examples of the specific task, relying on its pre-trained knowledge to infer the correct approach.
- Chain-of-Thought (CoT) Prompting: this technique consists of modifying the original few-shot prompting by adding examples of problems and their solutions and a detailed description of intermediate reasoning steps while describing the solution. (Check “How to distinguish all the СoT-inspired concepts and use them for your projects”)
- Other prompting techniques.
Retrieval-Augmented Generation (RAG): an architecture designed to harness the capabilities of large language models while providing the freedom to incorporate and update custom data at will. (Check ”What is Retrieval-Augmented Generation (RAG)?”)
Fine-Tuning: Involves additional training on a smaller, domain-specific dataset. This method adjusts the weights of the model to better align with the specific requirements of the task.

We've explored chain-of-thought as part of prompt techniques and RAG, finding that they are more straightforward to implement compared to fine-tuning. Prompt techniques (especially prompt engineering) are relatively simple, requiring only a few examples to guide the model. RAG, while also less demanding, offers the benefit of integrating domain-specific data without the need to retrain the model. However, fine-tuning, despite its higher cost in terms of computational power, memory requirements, time, and expertise, is sometimes the only viable option for certain tasks. This is particularly true in scenarios where the level of customization and accuracy needed goes beyond what prompt engineering and RAG can provide.

Source: Turing Post

When Fine-Tuning Is the Only Option: Key Scenarios

Highly specialized domain knowledge: Tasks demanding an in-depth understanding of specific fields, such as advanced medical research, complex legal cases, or technical engineering, require fine-tuning to ensure accurate content generation.
Custom vocabulary or jargon: In areas with specialized terminology, like certain scientific fields or niche technologies, fine-tuning helps the model correctly interpret and use this unique language.
Unique styles or formats: When a specific writing style or format is needed, such as for legal documents, academic papers, or particular literary styles, fine-tuning trains the model to meet these exact requirements.
Maintaining consistency with legacy data: Fine-tuning aligns the model with historical data or legacy systems, crucial for businesses needing consistent decision-making or analysis.
Highly regulated industries: In sectors where accuracy and regulatory compliance are essential, such as finance, healthcare, or law, fine-tuning ensures the model's outputs adhere to strict standards.
Sensitive or confidential data: Fine-tuning in a controlled environment is vital for tasks involving secure, private data, maintaining the necessary level of data security.
Custom problem-solving or decision-making logic: For tasks requiring specific problem-solving or decision-making processes, especially in technical or scientific fields, fine-tuning incorporates this unique logic into the model.

Given these scenarios, fine-tuning stands out for its ability to closely tailor a model to specific, often intricate requirements that broader methods can't adequately address.

So, the question arises: is there a way to make fine-tuning more efficient and less costly? One of the ways to do it is Low-Rank Adaptation (LoRA). Ready? Let’s go.

The intuition behind LoRA: intrinsic dimension

In 2018, researchers asked themselves a question: how many parameters (or model weights) are actually necessary for a model to perform well on various tasks? Their approach was innovative – they took random subsets of the model's original parameters and gradually increased their size. They were looking for the smallest subset size at which the model still worked effectively, a point they termed the 'measured intrinsic dimension' of a problem.

What they discovered was quite astonishing. In some instances, the intrinsic dimension was significantly lower than the total number of parameters in the original model. This insight opened the door to network compression, suggesting that models could be made smaller, yet remain effective. (Check ‘Large vs Small in AI: The Language Model Size Dilemma’.)

D is the total number of parameters. d is a subspace dimension. This figure shows the results of a series of experiments with gradually increasing d. As we can see, this increase in d produces monotonically increasing performances. dint90 is a threshold at which performance is at least 90% of the original model performance. Source: https://arxiv.org/pdf/1804.08838.pdf

Building on this idea, in 2020, Aghajanyan et al. took a step further. They demonstrated that many pre-trained models actually have a very low intrinsic dimension. This meant that there exists a more compact way to reparameterize* these models – a 'low dimension reparameterization' – that's just as effective for fine-tuning as using the full set of parameters.

*Reparameterization, in this context, refers to the process of representing the model's functionality with fewer, more efficient parameters.

This work paved the way for the development of Low-Rank Adaptation (LoRA). The creators of LoRA were inspired by the concept that the changes in a model's weights during adaptation also exhibit a low intrinsic dimension.

How LoRA works: low-rank matrices explained

LoRA essentially proposes a method to adapt and fine-tune large models in a more resource-efficient way. The core idea is to modify only a select subset of the model's weights, rather than the entire network, making the fine-tuning process more feasible and less demanding in terms of resources.

Source: The original LoRA paper

Here's a more streamlined explanation of how LoRA works:

LoRA introduces pairs of low-rank matrices (known as update matrices) to the model's existing weights.
Only the newly added low-rank matrices are trained while keeping the pre-trained model's original weights. These are much smaller in number compared to the original model's parameters.
The trained parameters are then added to the pre-trained model, enhancing its capabilities without a complete overhaul.

Benefits of LoRA

In one of our previous practical articles, we talked with practitioners about leveraging LLMs in your project. Using LoRA or other type of parameter-efficient fine-tuning was one of the general recommendations in case fine-tuning is necessary.

Let’s briefly list the benefits of LoRA:

Reduced Training Demands: The training process becomes more efficient and less resource-heavy, reducing the hardware requirements. This is particularly advantageous when using adaptive optimizers, as it negates the need for gradient calculations or maintaining optimizer states for the majority of parameters.
Greater Memory Efficiency: This efficiency allows training on more accessible, consumer-grade GPUs, reducing the barrier to entry for fine-tuning large models.
No Inference Latency: The design of LoRA allows the merging of trainable matrices with frozen weights when deployed, ensuring no additional inference latency compared to a fully fine-tuned model.
Efficient Task Switching: By freezing the shared model and replacing only the matrices for different tasks, LoRA significantly reduces storage requirements and makes task-switching more efficient.
Portability of Trained Weights: The rank-decomposition matrices in LoRA have far fewer parameters than the entire model, making the trained LoRA weights easily transferable.
Combination with Other Methods: LoRA is compatible with various prior methods, such as prefix-tuning, enhancing its versatility.
Mitigating Catastrophic Forgetting: By keeping the original weights frozen, LoRA reduces the risk of catastrophic forgetting, a common issue in neural network training where new learning can disrupt previously acquired knowledge.

Useful links

Conclusion

To sum up, LoRA emerges as a crucial development in the fine-tuning of LLMs, particularly for models with extensive parameters. It addresses key challenges associated with traditional fine-tuning methods, notably in terms of computational and memory requirements.

However, it's important to understand that fine-tuning, made more efficient by LoRA, is just one of several adaptation techniques available for LLMs. Other methods, such as prompt engineering and RAG, provide alternative approaches. Each technique offers different levels of customization and efficiency, and their effectiveness can vary depending on the specific use case.

Thank you for reading, please feel free to share with your friends and colleagues. In the upcoming weeks, we are announcing our referral program 🤍

Previously in the FM/LLM series: