One term that has become a buzzing topic recently is RAG.
What is it, and how you can utilize it to improve an LLM performance? Let’s dive in!
We discuss the origins of RAG, what LLMs limitations it tries to fix, RAG’s architecture, and why it is so popular. You will also get a curated collection of helpful links for your RAG experiments.
Introduction
Though circulating very actively lately, the term itself came in 2020, when researchers at Meta AI introduced it in their paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
Retrieval-Augmented Generation (RAG) model is an architecture designed to harness the capabilities of large language models (LLMs) while providing the freedom to incorporate and update custom data at will. Unlike the resource-intensive process of constructing bespoke language models or repeatedly fine-tuning them whenever data updates occur, RAG offers a more streamlined and efficient approach for developers and businesses.
As you probably know, pre-trained language models undergo training using vast amounts of unlabeled text data in a self-supervised* manner. Consequently, these models acquire a significant depth of knowledge, leveraging the statistical relationships underlying the language data they have been trained on.
*Self-supervised learning uses unlabeled data to generate its own supervisory signal for training models.This knowledge is encapsulated within the model's parameters, which can be harnessed to execute various language-related tasks without the need for external knowledge sources. This phenomenon is commonly referred to as a parameterized implicit knowledge base.
Although this parameterized implicit knowledge base is very impressive and allows the model to have a surprisingly good performance for some queries and tasks, this approach is still prone to errors and, so-called, hallucinations*.
*Hallucination in language models occurs when false information is generated and presented as true.Why do errors happen in LLMs?
It is essential to recognize that LLMs do not possess a genuine understanding of language in the human sense. They rely on statistical patterns within the language they were trained on. Recent research has shown that no matter how much implicit knowledge a model has, it still has trouble with logical reasoning. While LLMs have achieved significant success in text generation, they still have problems using the data they already have, which often results in hallucinations.
How to deal with it? Surprisingly, introducing more external data. It can be used to expand or revise the model’s memory and as a base to assess and interpret its predictions. This is precisely what the Meta AI researchers implemented in the new type of models called RAG models.
Other limitations of current LLMs
Apart from hallucinations, contemporary language models suffer from a significant shortcoming for companies that want to implement them – they lack a company's internal data context. To address this issue through fine-tuning, ML practitioners must repeatedly adjust the model whenever the data undergoes changes. RAG addresses these limitations as well.
RAG Architecture
RAG combines two main components:
1. The Retriever: This component is a pre-trained neural retriever that accesses non-parametric data external to the language model, stored as a dense vector index. The original paper uses the Dense Passage Retriever to access a dense vector index of Wikipedia.
2. The Generator: This is a pre-trained language model. In the original paper, it's a pre-trained seq2seq transformer.
The retriever and generator components are trained jointly without any direct supervision on which document should be retrieved.

Image credit: The original paper
Here's how the architecture works when a user submits a query:
Retrieve Phase: First, the retriever searches for and fetches the most relevant text passages for the user's query.
Generative Phase: In this phase, the transformer generates text based on its parameterized implicit knowledge, the user's query, and the retrieved text. In other words, the final generated text results from not only the generator's built-in knowledge but also the external data fetched in the first step.
This approach enriches parametric memory (refers to the knowledge and capabilities embodied within a pre-trained seq2seq model) generation models by incorporating non-parametric (refers to extensible, real-time knowledge source that isn’t confined by a fixed set of parameters) memory through a fine-tuning method.
Why RAG is so popular?
RAG's popularity arises from its unique combination of advantages and cost-effectiveness. It combines the generation flexibility of "closed-book" (parametric-only) approaches with the performance of "open-book" retrieval-based approaches.
RAGs benefits:
Dynamic Knowledge Control: RAG allows easy modification and supplementation of its internal knowledge without retraining the entire model, saving time and resources.
Current and Reliable Information: It ensures the model always has access to the most up-to-date and trustworthy facts.
Transparent Source Verification: Users can verify the model's claims by accessing its sources, enhancing trust in its outputs.
Mitigation of Information Leakage: RAG's grounding in external, verifiable facts reduces the chances of the model leaking sensitive data or generating incorrect information.
Domain-specific knowledge: RAG can provide extra context and information about your internal data.
Cost-Efficient and Low Maintenance: RAG reduces the need for continuous training and parameter updates, lowering computational and financial costs for enterprise LLM-powered chatbots.
Conclusion
The Retrieval-Augmented Generation (RAG) model is a step toward addressing the limitations of LLMs. By blending pre-trained parametric and extensible non-parametric memory, it aims to tackle outdated information and hallucinations. This mix allows for more accurate and domain-specific responses, presenting a cost-effective, low-maintenance option for enterprises. However, the effectiveness of RAG is influenced by the precision of the embedding algorithm, database performance, and the context window size of the foundational model. Additionally, retrieval latency and reliance on initial training data highlight areas for further refinement. While RAG models present a notable advancement, they also underline the continued journey toward achieving more reliable and evolved language models in ML and NLP.
Tools for RAG implementation
RAG relies on vector storage for retrieval. See our curated list of open-source vector databases and libraries, then pick your implementation tools below.
RAG-Token Model, RAG-Sequence Model, a non-finetuned version of the RAG-Sequence model from the original paper
RAG on Hugging Face Transformers (open-source)
REALM library (open-source)
NVIDIA NeMo Guardrails (open-source)
LangChain (open-source)
LlamaIndex (open-source)
Weaviate Verba: The Golden RAGtriever (open-source)
Deepset Haystack (open-source)
Arize AI Phoenix (open-source)
Platforms that offer free plans with limited usage:
IBM’s watsonx, a new AI and data platform offers RAG (free for Individuals and POCs)
Oracle Database 23c (free for developers)
Cohere Chat API with RAG in public beta (free plan for prototyping)
Amazon SageMaker (free plan with limited usage)
Cloudflare Workers AI (free plan with limited usage)
Azure Machine Learning (free trial, no free plan)
ChatGPT Retrieval Plugin (under waitlist)
Links to resources
Turing Post
Sharon Zhou on why RAG is overhyped — and what actually works: Interview with Sharon Zhou from Lamini








