This website uses cookies

Read our Privacy policy and Terms of use for more information.

Retrieval-Augmented Generation (RAG) has become increasingly popular in recent years as a way to enhance large language models (LLMs) with external knowledge.

We have covered RAG-related topics in the following articles:

These articles are, so far, the most-read, which demonstrates the ongoing interest in the topics. However, the traditional RAG models, developed around 2020, were designed when LLMs were severely limited in their ability to handle long contexts. This led to a design where retrievers worked with short text units, typically 100-word Wikipedia paragraphs, requiring them to search through massive corpora to find relevant information.

The landscape of language models has changed dramatically since then. In 2023 and 2024, we've seen the emergence of LLMs capable of handling much longer contexts, with some models able to process up to 128,000 tokens or even 1 million tokens (as with Google's Gemini 1.5 Pro). This significant increase in context length capabilities has opened up new possibilities for RAG systems.

Enter the LongRAG framework. By revisiting the fundamental design choices of RAG systems in light of recent advancements in LLMs, LongRAG offers a promising direction for improving the performance and boosting RAG with long-context LLMs. Let’s dive in!

In today’s episode, we will cover:

  • Original RAG and it’s working process

  • Intuition behind LongRAG

  • How LongRAG works: the architecture

  • LongRAG Advantages

  • Bonus: Resources

Original RAG and it’s working process

RAG enables the use of LLMs on previously unseen data without requiring fine-tuning. Additionally, knowledge in natural language form can be completely offloaded from the parametric memory of LLMs by leveraging a separate retrieval component from an external corpus.

RAG working process:

  • Query encoder: It encodes a user query into a numerical representation suitable for searching through a database of text passages or documents.

  • Retriever: It searches an external database of indexed documents using the vector produced by the query encoder. The retriever identifies the top-K most relevant documents based on the selected search algorithm.

  • Generator: The large language model conditions on the documents selected by the retriever and the input query to generate the output.

Intuition behind LongRAG

In the paper “LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs” the researchers from University of Waterloo propose modifications to the retrieval process by:

  • Extending the token size of each document from 100 tokens in the original RAG to 4,000 tokens in LongRAG.

  • Shifting the focus from pinpointing exact information related to the user's query in the original RAG to selecting documents that contain relevant, though not necessarily exact, information in LongRAG.

Image Credit: The original paper

The rationale behind this is to transition from retrieving precise, small snippets of information to selecting larger, more contextually rich and semantically integral pieces. This adjustment eases the burden on the retriever and more evenly distributes tasks between the retriever and the generator. Consequently, LongRAG capitalizes on the extended context capabilities of the latest LLMs, which serve as the generator, benefiting from recent significant enhancements in processing long contexts.

How LongRAG works: the architecture

LongRAG introduces three architectural updates to the original RAG:

  • Long Retrieval Unit: Instead of cutting 100 tokens from a large document, LongRAG uses retrieval units as extensive as an entire document or a group of documents, following the Group Documents Algorithm proposed in the paper. This increase in unit size to 4K reduces the Wikipedia corpus from 22M to 600K retrieval units.

  • Long Retriever: Identifies the Long Retrieval Units for further processing.

  • Long Reader (Generator): Extracts answers from the Long Retrieval Units. It is an LLM prompted with a user query and the Long Retrieval Units

Image Credit: The original paper

 This is how LongRAG works step-by-step:

  • Retrieval (named Long Retriever in the original paper):

    • Encoding: Two encoders map the input question and the retrieval unit each to a d-dimensional vector.

    • Forming long retrieval units: Group Documents Algorithm involves creating groups of related documents. Each document is grouped with related documents based on connectivity, without exceeding a specified maximum group size. This grouping allows for more efficient and relevant information retrieval, as related documents are processed together.

    • Similarity search: The vectors from the encoding step are utilized to compute the similarity between the question and the retrieval unit when selecting the relevant long retrieval units.

    • Results aggregation: The top most relevant groups are aggregated to form a comprehensive response to the query, adjusting the number of groups included based on their size.

  • Generation (named Long Reader in the original paper): The LLM takes the user query and the aggregated result from the retrieval step and generates the final output.  It’s important that the LLM used in the long reader can handle long contexts and does not exhibit excessive position bias.

Advantages

LongRAG optimizes retrieval by processing Wikipedia into 4,000-token units, greatly reducing the number from 22 million to 600,000. This increase in unit size means less need for recalling many units, avoiding truncation, and preserving more context. The longer units help in directly answering complex questions by amalgamating comprehensive information.

It achieves impressive retrieval scores and comparable results to state-of-the-art models without additional training, demonstrating the efficiency and potential of combining RAG with long-context LLMs.

Here are some key results from the implementation of LongRAG:

  • Recall@1: Increased to 71% on the Natural Questions (NQ) dataset, up from the previous 52%.

  • Recall@2: Improved to 72% on the HotpotQA dataset (full-wiki), a rise from the earlier 47%.

  • Exact Match (EM): Achieved an EM score of 62.7% on NQ and 64.3% on HotpotQA (full-wiki), demonstrating performance on par with state-of-the-art models.

Bonus: Resources

Let us know if you experimented with LongRAG, and what is your feedback.

Thank you for reading! Share this article with three friends and get a 1-month subscription free! 🤍

Reply

Avatar

or to participate

Keep Reading