What Is QDoRA? DoRA, QLoRA & QDoRA Fine-Tuning Explained

When specialized AI models, such as coders or those designed for specific domains, emerge, fine-tuning is crucial for boosting their performance and tailoring them to specific tasks. Low-Rank Adaptation (LoRA), which we previously discussed, is one such method, but it doesn't always deliver the best results in certain cases. To address these limitations, advanced methods like Weight-Decomposed Low-Rank Adaptation (DoRA), QLoRA, and QDoRA have been developed, offering enhanced performance. What are these methods, and how do they improve upon LoRA? Let’s explore!

In today’s episode, we will cover:

What’s wrong with LoRA?
Here comes Weight-Decomposed Low-Rank Adaptation (DoRA)
How does DoRA work?
The benefits of DoRA
How good are DoRA’s results?
What is QLoRA, and why should you use it?
How does QLoRA work?
Results of QLoRA
How does QDoRA enhance QLoRA?
Conclusion
Bonus: Resources

Why LoRA Falls Short of Full Fine-Tuning

To tailor general models for specific tasks, we need to fine-tune them. This process usually involves retraining all the model’s parameters, but as models get bigger, this becomes more expensive and resource-intensive.

To solve this, Parameter-Efficient Fine-Tuning (PEFT) methods were developed, and one of them is LoRA (Low-Rank Adaptation). LoRA modifies fewer parameters for efficient fine-tuning while keeping the model architecture the same. However, it often doesn’t perform as well as full fine-tuning (FT) where all parameters are retrained. But why?

NVIDIA and HKUST researchers, inspired by the concept of Weight Normalization, compared LoRA and FT. Weight Normalization separates the weight matrix into two components – magnitude (how large the change is) and direction (where the change is happening in the parameter space). This separation helps better understand how each method updates the model’s weights, allowing a more detailed comparison of the flexibility and precision in adjustments made by LoRA and FT.

They discovered that these two methods update the model in different ways":

LoRA updates the model proportionally, changing both magnitude and direction consistently.
Full fine-tuning, on the other hand, shows more complex behavior. It can make subtle changes in direction while making larger changes in magnitude or vice versa. This flexibility allows FT to adapt more precisely to tasks.

As LoRA lacks this flexibility, it may not always be able to make the same precise adjustments as FT. But full fine-tuning still requires retraining all parameters, which is computationally intensive.

What Is DoRA? Weight-Decomposed Low-Rank Adaptation

To bridge the gap between efficiency and performance, NVIDIA and HKUST researchers developed a new method called DoRA (Weight-Decomposed Low-Rank Adaptation). DoRA separates the pre-trained weights into magnitude and direction, just like in their analysis, and fine-tunes both parts. It makes fine-tuning easier and more stable.

DoRA has been tested on various tasks, from NLP to tasks that involve both language and vision, and it consistently performs better than LoRA without adding extra inference time.

How DoRA Works: Step-by-Step Weight Decomposition

DoRA uses LoRA for the directional updates to keep things efficient but improves the model’s overall learning ability, making it behave more like full fine-tuning.

Here’s how it works step by step:

Image Credit: Original DoRA paper

Decomposing weights: DoRA starts by splitting the pre-trained weights into two parts:
- Magnitude: This represents how large or strong the weight is.
- Direction: This represents how the weight is changing or moving during training.

DoRA fine-tunes both the magnitude and the direction separately:
- The magnitude is directly fine-tuned and can be adjusted during training.
- The direction part is handled using LoRA, because it is a much larger part. LoRA breaks this into smaller, more manageable updates without adding too many trainable parameters.
After fine-tuning, DoRA merges updates with the original weights, keeping the model as efficient as before without extra computation for predictions.

DoRA vs LoRA: Key Differences and Performance Gains

As DoRA method was made to solve the limitations of LoRA let’s look at its advantages:

Simplifies the task: By breaking the weight updates into two parts, DoRA makes fine-tuning easier and more stable.
More efficient: It uses LoRA to handle the complex directional updates, which reduces the number of parameters it needs to train. Focus only on updating the direction simplifies LoRA’s job.
Performs closer to full fine-tuning: DoRA shows better learning capacity, performing closer to full fine-tuning methods (which retrain all parameters), without the heavy computational cost and getting overwhelmed.

Image Credit: Original DoRA paper

DoRA Benchmarks: How It Compares to LoRA and Full Fine-Tuning

To explore how good is DoRA’s performance researchers tested it on various benchmarks and different tasks and here’s what they have found out:

Tested on eight commonsense reasoning datasets using the LLaMA family of models (7B, 13B, LLaMA2-7B, LLaMA3-8B) DoRA outperformed LoRA by 14.4%.

Image Credit: Original DoRA paper

Image-Text Tasks: DoRA surpassed LoRA by ~1%.
Video-Text Tasks: DoRA performed 2% better than LoRA, achieving results close to full fine-tuning (FT) accuracy.
Tested on visual instruction tasks using the LLaVA-1.5-7B model.DoRA showed 0.7% higher accuracy than LoRA and 1.1% better accuracy compared to FT.
Memory Efficiency: The training memory required for DoRA was reduced by 24.4% for LLaMA and 12.4% for VL-BART compared to the unmodified version.

DoRA’s ability to fine-tune both the magnitude and direction of weight updates has significantly improved the flexibility of parameter-efficient fine-tuning methods, particularly when precise adaptations are required for specialized tasks. However, even though DoRA enhances performance without adding inference time, there remains another pressing challenge in fine-tuning large models: managing memory usage. This is where QLoRA comes in.

What Is QLoRA? Quantized Low-Rank Adaptation Explained

On the path to fine-tuning huge LLMs (like a 65 billion parameter one) and using less memory, researchers from University of Washington developed QLoRA: Efficient Fine-tuning of Quantized LLMs. It drastically cuts down the memory needs from 780 GB to less than 48 GB, allowing it to run on a single 48GB GPU, common in high-end consumer computers. Despite using less memory, it keeps the same high performance as regular fine-tuning methods and doesn’t slow things down. QLoRA is all about quantization.

How does QLoRA work?

Quantization reduces data by converting it to a simpler form, for example, changing 32-bit floats to 8-bit integers. This stores less detail but remains close to the original. Data is scaled to fit properly in the new range.

QLoRA implements a few memory-saving techniques which are steps of its working process:

4-bit NormalFloat (NF4) Quantization: This process simplifies the data by converting large amounts of information into smaller, easier-to-manage chunks. Think of it like taking high-resolution images and resizing them without losing too much quality and remaining close to the original. Data is scaled to fit properly in the new range.
Double Quantization: It’s a trick to shrink memory even more by simplifying how data is stored. This is a second step to further reduce memory use, which simplifies data even more, using fewer bits without losing accuracy. It quantizes the quantization constants, saving an average of about 0.37 bits per parameter (approximately 3 GB for a 65B model).

“On average, for a blocksize of 64, this quantization reduces the memory footprint per parameter from 32/64 = 0.5 bits, to 8/64 + 32/(64 · 256) = 0.127 bits, a reduction of 0.373 bits per parameter.” – From Original QLoRA paper.

Paged Optimizers: These are smart memory managers. When the GPU runs out of memory, paged optimizers automatically move data to the CPU and bring it back when needed. This prevents the system from crashing or slowing down due to memory spikes.

Image Credit: Original QLoRA paper

QLoRA Benchmarks and Memory Savings

The Guanaco model family, trained using QLoRA, demonstrated great performance:

The Guanaco 65B model achieves 99.3% of the performance level of ChatGPT on the Vicuna benchmark, closing the gap to ChatGPT with only 24 hours of fine-tuning.
The Guanaco 33B model also performs very well, achieving 97.8% of ChatGPT's performance.

Image Credit: Original QLoRA paper

Even with QLoRA, loading the model onto GPUs still requires a lot of memory. Can something be better at reducing memory use than QLoRA? The answer is QDoRA.

What Is QDoRA? How It Combines DoRA and QLoRA

QDoRA replaces LoRA with DoRA in the QLoRA framework, introducing a more advanced approach to fine-tuning by decomposing the weight matrix into two components: magnitude and direction. This approach was introduced in a recent project ”Efficient finetuning of Llama 3 with FSDP QDoRA” led by Kerem Turgutlu. It also uses FSDP (Fully Sharded Data Parallel) to split the model and train it across multiple GPUs, making it even more efficient.

This allows for more precise weight adjustments during fine-tuning, unlike QLoRA, which applies uniform changes. By combining this decomposition with quantization, QDoRA enhances the flexibility of full fine-tuning while retaining the memory efficiency of QLoRA.

This approach enables QDoRA to outperform QLoRA in various tasks by making more granular updates to model weights, thus achieving closer performance to full fine-tuning without significantly increasing memory or computational requirements.

In tests, QDoRA fine-tuned LLaMA2-7B and LLaMA3-8B on Orca-Math dataset, outperforming both QLoRA and full fine-tuning with less memory use. This makes QDoRA ideal for developers with limited GPU resources.

Image Credit: Original DoRA paper

Llama3, when enhanced with either QDoRA or full fine-tuning, significantly exceeds the performance of both QLoRA and Llama2 during math data training.

Image Credit: Answer.AI

In summary, QDoRA brings together the best of both worlds – efficient memory usage and more adaptable, performance-oriented fine-tuning.

DoRA vs QLoRA vs QDoRA: Which Method to Choose?

We discussed several methods that work better than LoRA method for fine-tuning:

DoRA is more flexible and approach than LoRA, that makes fine-tuning easier and more stable.
QLoRA, using quantisation, allows using less memory while fine-tuning huge LMs.
QDoRA combines DoRA and QLoRA for efficient fine-tuning that is really close to or outperforms full fine-tuning.

Every task needs its specific approach, and when in one case LoRA will be enough for efficient fine-tuning, in the other case (with larger models and with huge memory) only QDoRA can work efficiently.

Resources