Great news! In only 2.5 months we reached 1000 subscribers on YouTube. Please subscribe as well, or listen our interviews as podcasts on Spotify or Apple.
What is actually happening with retrieval-augmented generation (RAG) in 2025? In this episode of Inference, I sit down with Amr Awadallah β founder & CEO of Vectara, founder of Cloudera, ex-Google Cloud, and the original builder of Yahooβs data platform β to unpack it all. We get into why RAG is far from dead, how context windows mislead more than they help, difference between RAG and fine-tuning, and what it really takes to separate reasoning from memory.
Amr breaks down the case for retrieval with access control, the rise of hallucination detection models, and why DIY RAG stacks fall apart in production. We also talk about the roots of RAG, Amrβs take on AGI timelines and what science fiction taught him about the future. This interview is packed with insights.
If you care about truth in AI, or you're building with (or around) LLMs, this one will reshape how you think about trustworthy systems. Watch it now β
This is a free edition. Upgrade if you want to receive our deep dives directly in your inbox. If you want to support us without getting a subscription β do it here.
The transcript (edited for clarity, brevity, and sanity.) All supporting links are in the YouTube description, just below the video. Check it out):
Ksenia Se:
Hi Amr, welcome to Inference by Turing Post. Very happy to see you here.
You built large-scale data systems at Yahoo, helped launch the data platform era at Cloudera, and led developer strategy at Google Cloud. Now, at Vectara, you're tackling something deeper: the layer of trust β the foundation of truth for AI.
So my question is: as more models come out with bigger context windows, some people are starting to say RAG is dead.
Is it? What do people misunderstand about RAG?
Amr Awadallah:
Even the largest companies β Google, Amazon, Microsoft β theyβre all still leveraging RAG, even with these larger context windows.
The larger window just means we can give the language model more information in the prompt. But we still have to pick what to give it. If we just stuff that window full of information β some of it irrelevant or noisy β the model actually struggles. It can hallucinate more. It might not find the right facts buried inside all that input.
These models are pretty good at what's called single needle retrieval β they can find one, maybe two useful pieces of information in a haystack. But theyβre not great at finding multiple relevant facts all at once. They also tend to pay more attention to content at the beginning or end of the window, and less to the middle.
RAG, on the other hand, separates memory from reasoning. Memory is your knowledge; reasoning is your intelligence. Itβs how humans work too. Take a high school exam: if it's open book and you have an assistant who highlights exactly the right sentence to focus on, you'll perform way better. Same with language models. If we add a smart retrieval layer that surfaces the most relevant βneedles,β the model can reason more effectively with that curated input.
And thereβs another big benefit: security.
If you dump everything into the context window, you can't control what users might access. They might use prompt injection attacks to retrieve things they shouldnβt. But with a retrieval system in front, you can filter and mask sensitive information. No matter how someone phrases a prompt, they wonβt be able to access what they're not allowed to see.
Ksenia:
Is RAG the main component in building trust in AI β or are there others?
Amr:
Yes, RAG β with access control β is definitely one of the key components. Retrieval with proper access control is one of the best ways to mitigate against information leakage, especially from prompt injection attacks.
Compare that to fine-tuning. If you take a large language model and just stick all your information into it β whether through a large context window or actual fine-tuning β then someone could eventually extract that information with a cleverly structured prompt. With RAG, thatβs not possible, because the information never even reaches the model unless itβs relevant. The retrieval layer filters it before it gets to the LLM. Make sense?
Ksenia:
Absolutely. So then β why do people keep saying RAG is dead?
Amr:
Honestly, nobody serious in the industry says that. Maybe some reporters or analysts, but anyone actually deploying AI over their companyβs data β theyβre using RAG.
All the major players β Microsoft, Google, Amazon, even OpenAI β theyβre doing it that way. Just a few weeks ago, OpenAI launched something called Open Connectors. That lets you connect your ChatGPT session directly to your data. Thatβs RAG. They're not shoving all your data into the model or cramming it into the context window. No β they retrieve the right nuggets on demand.
Letβs say you ask a question, and the answer is in your Google Drive. RAG fetches the three or four relevant documents and feeds just that into the prompt.
So RAG is absolutely a foundational architecture β and it will be with us for many, many years. I donβt see it going away. Iβd challenge anyone who says otherwise.
Thereβs also a big performance angle here. Putting a lot into the context window is expensive. The compute cost scales with what we call order N squared β meaning, if you double the number of words, you quadruple the cost.
Smart retrieval systems, on the other hand, are sublinear β around log N. So theyβre way more efficient, both in compute cost and in latency. Theyβre faster, cheaper, and more targeted.
So again β if you look at any serious, production-grade system thatβs reasoning over data, itβs using RAG.
Ksenia:
Okay, okay. Next time I hear an ML person saying itβs dead, Iβll send them straight to you!
Amr:
Yes, please.
Ksenia:
A lot of teams think that building a RAG system just means plugging into a vector database. As I understand it, youβve introduced a lot more β metadata, updates, citation tracking, scoring, observability.
So where do DIY or in-house systems fall short in your opinion?
Amr:
There are actually two questions here.
First β is retrieval only about vector systems? The answer is no. If it were, weβd be calling it VAG β vector-augmented generation β not RAG. The βRβ stands for retrieval, and retrieval can happen in many ways.
Vector databases are great for semantic data. If you're searching over unstructured documents and doing semantic matching, then yes β vectors and vector DBs are very effective.
But if you're working with knowledge graphs or instruction-based data, something like Neo4j would be more appropriate.
If youβre dealing with semi-structured documents, MongoDB might be a better fit.
And for traditional structured or numerical data, Snowflake or Oracle would make more sense.
So RAG is really about combining all these sources β and retrieving the most relevant βneedlesβ from across them, based on the task at hand. Vector DBs are just one part of that picture.
Now, on the second part of your question β building this in-house.
Itβs true that you can build a quick demo pretty easily. Take a few documents, plug them into a large context window or vector DB, and set up a chatbot. You could do that in a few hours. Maybe a couple of days, tops.
But scaling that to production is a completely different story.
In a real organization, youβve got dozens of data types: documents, numerical tables, semi-structured records, graph-based knowledge. You need to handle all of them.
You also need to enforce access control across every data source β to make sure users only see what theyβre allowed to see. Then thereβs hallucination mitigation, accuracy, traceability β all of which we can talk more about.
And on top of that, you have to keep up with the fast pace of model evolution. One week it's better table extraction from PDFs. Next week it's better OCR from images. If you're building this system yourself, you have to stay current on all of that β constantly.
Thatβs exactly the problem Vectara solves for our customers.
We say: let you build the intelligence and business logic thatβs unique to your company β and let us handle the heavy lifting on the system side. Weβll stay up to date with the latest models, handle accuracy and hallucination issues, and keep your data secure against prompt attacks.
Ksenia:
Iβve spoken with a lot of machine learning companies that started in 2020 or 2022 β and you started in 2022, if Iβm not mistaken. That was before the ChatGPT moment, before this generative AI boom, before everyone started talking about RAG. What did the ChatGPT moment change for your company? Did you pivot?
Amr:
Itβs a good question. Honestly, it didnβt change much for us β and thatβs because, as you mentioned in the intro, I used to work at Google.
At Google, they had a system called MEENA and it was already as good as ChatGPT in many ways. They had it two years earlier. It was amazing. It could hold a coherent conversation. Maybe not reasoning at todayβs level, but it could definitely pass a Turing Test-like interaction. So we knew generative capabilities were coming. That was clear.
From the very beginning at Vectara, our focus was retrieval:
How do you find the most relevant needles in the haystack β and how do you do that in a way that lets businesses define their own logic around it?
For example, maybe a document from Sue should always rank higher than one from Joe β because Joe sometimes gets things wrong. Or a document from last month should take priority over a PowerPoint from last year. That kind of business context was part of how we designed retrieval from day one.
Then ChatGPT came out β and we simply plugged our retrieval engine behind the LLM. And thatβs when the hallucination problem became obvious.
Even when you ground the model β meaning, even when you tell it βhereβs the source of truth, please answer based on this and donβt make anything upβ β it still makes stuff up.
Things are improving, thankfully. When ChatGPT first launched, grounded hallucination rates were around 10β20%. Now, advanced models β including ours β are down to about 1%, which is amazing.
But still dangerous.
If youβre using this for medical diagnoses, legal contracts, supply chain analysis, anti-money laundering β that 1% can be the difference between major success and complete disaster.
So thatβs where we focused. Back then, we set our sights on solving that problem.
And today, weβve built one of the most successful hallucination detection models in the world. Itβs called the Hughes Hallucination Evaluation Model (HHEM) β open source on Hugging Face β and it has over 4 million downloads today.
Ksenia:
Thatβs incredible. A couple of months ago, I spoke with Sharon Zhou (ex-Lamini), at that time sh'eβs also been working on hallucinations. She told me they had to do βsurgeryβ on the models β as in, actually correcting the weights.
But your approach is different, right? Youβre verifying the output against the source?
Amr:
Yes and no. We actually do both approaches now. We donβt build large language models from scratch ourselves β weβre not funded at that level. As you know, training one of those models can cost $50β60 million.
Ksenia:
Or you can talk to the Chinese and do it much cheaper.
Amr:
And thatβs exactly what we did. We started with LLaMA β we were using LLaMA 2 at the time β and we fine-tuned it. So yes, we do modify the weights inside the model. That kind of fine-tuning helps reduce hallucinations.
The process is simple in principle: when the model gets an answer right, we give it a cookie. When it gets it wrong, we slap it on the face β punish it, in other words. Over time, it learns not to answer when itβs unsure. Thatβs what minimizes hallucinations.
So yes, we fine-tune β but hereβs the key: even after you do all that surgery on the model weights, hallucinations still happen. If you look at our hallucination leaderboard β just Google it, itβs the top result β even the best models from OpenAI, Google, and others still hallucinate around 1%.
Which means you canβt just depend on the model as-is, even after fine-tuning. You still need what we call a guardian angel β or more precisely, a guardian agent β that monitors the modelβs output and catches hallucinations when they happen.
Thatβs the only way to truly scale trust. Letβs say youβre using an LLM for customer support. If the guardian agent detects no hallucinations, the response can go straight to the customer. If a hallucination is suspected, then the answer gets routed to a human reviewer before itβs sent out. And thatβs the setup we believe leads to massive productivity gains β safe, reliable automation with a human safety net where it counts.
Ksenia:
Just to clarify β do you have a human in the loop or not?
Amr:
Yes β we advise our customers to include a human in the loop, depending on the use case. Whenever the hallucination risk is high, thatβs when the human should step in.
That said, about four weeks ago we released a hallucination correction model. It helps bridge the gap. So if our detection model flags an issue, we first pass it to the correction model. If it successfully fixes the hallucination, no human is needed.
But if the correction model fails, then you activate the human reviewer.
Ksenia:
You mentioned you open-sourced the model. I know youβre a big proponent of open source. Why is that important to you β and how does it help your business actually make money?
Amr:
So first β open source doesnβt help you make money directly because, well, youβre giving something away for free.
But what open source does help with β and Iβve seen this firsthand at Cloudera, my previous company, β is awareness in the developer ecosystem. There are a lot of developers who wonβt even try your product unless thereβs an open-source component.
So by releasing something open source, you reach far more developers. And later, that gives you an opportunity to build relationships with them or their organizations β and sell them your broader platform.
That said, weβre not open-sourcing everything. At Cloudera, we made the entire platform open source β and that made it really hard to monetize. Some customers would just say, βWhy pay you? We can just use it ourselves.β
And then some of the big hyperscalers β I wonβt name names, but one of them starts with the letter A β would take the software, give you nothing in return, and youβd have no recourse.
So with Vectara, weβre being more tactical. We havenβt open-sourced the full SaaS platform for RAG β that stays proprietary.
But we did open source our hallucination detection model β our guardian agent β because that raises awareness. It shows the community that weβre one of the leading companies solving this problem and delivering accurate answers.
Ksenia:
What are your next steps in working on hallucinations?
Amr:
The next big step is detecting hallucinations in multimodal content.
Right now, we focus mainly on text β extracting information and making sure the modelβs response is grounded in it. But hallucinations can also happen with tables β and thatβs really important. You need to verify that answers based on table data actually match whatβs in the cells.
Then there are diagrams β and with diagrams, itβs not enough to just OCR the words. You need to understand the visual structure to detect if the model is hallucinating.
And later on, weβll extend that to video and images as well.
Ksenia:
That sounds exciting. Multimodal models are definitely the next big thing.
I was reading your LinkedIn and really liked your post setting the record straight on who actually invented RAG. Iβm passionate about history β we cover a lot of it at Turing Post.
So for the people watching and reading β who really did invent RAG?
Amr:
Thatβs a great question β and the reason I wrote that post is because some startups and individuals today are claiming, βWe invented RAG!β
And Iβm like β no, this approach goes way back.
The roots of RAG are in the research community β pioneers in information retrieval (IR) and natural language understanding (NLU), going all the way back to the 1960s.
One early example was a system called BASEBALL. It showed this idea that, if you want reliable responses, you should mimic how humans operate:
First, retrieve the right facts β then reason based on those facts.
If you combine those steps, you get more accurate and more secure systems.
So yes, credit should go where itβs due β to the IR and NLU researchers who have been refining these ideas for over 60 years.
Ksenia:
Sixty years β people tend to forget history.
Amr:
Exactly. And I like to remind them β and I can see you do too.
Whatβs different today is that we now have LLMs that are very good at generation.
Back then, generation was more symbolic β stitching together parts of sentences, decision support, templates. Now with LLMs, we have fluent, context-aware generation.
Ksenia:
With models getting more multimodal and more complex β do you think itβs possible to eliminate hallucinations completely?
Amr:
Not with the current transformer architecture.
Transformers and deep neural nets β the kind Geoffrey Hinton won his well-deserved Nobel Prize for β are inherently probabilistic.
And with probabilistic systems, thereβs always a chance of a false positive or a false negative. Thatβs where hallucinations come from.
Over the past few years, weβve made great progress β hallucination rates have dropped to around 1%. Maybe weβll get it down to 0.5%. But we seem to be hitting a plateau.
The good news is: if we can perfect hallucination detection, then the problem becomes manageable. You can imagine a system saying, βThis sentence in this 20-page report might be off β have a human review it.β If that happens only 0.5% of the time, itβs something we can live with.
So thatβs what weβre focused on β improving detection to the point where hallucination becomes a non-issue.
Until someone invents a brand-new architecture that isnβt probabilistic β and doesnβt hallucinate β weβll still be living in this space. Will that happen? I donβt know. But I do know that as long as weβre using transformers, hallucinations arenβt going away.
Ksenia:
There are a bunch of new architectures emerging. Is there any particular one you're especially interested in?
Amr:
I havenβt seen any architecture yet that isnβt probabilistic in nature. The only real alternative would be symbolic approaches β which are more accurate, sure, but extremely hard to scale. They just donβt work at the same level as LLMs.
Ksenia:
Thatβs really interesting. Thereβs still so much to research. Do you do any research yourself?
Amr:
Not personally β but my team does. At Vectara, weβre about 60 people. Five of them are fully focused on machine learning research. Another 25 are building the platform β security, reliability, observability, and so on.
Our ML research is laser-focused on the guardian agents problem: How can we monitor the output of LLMs β not just for accuracy, but also for bias, toxicity, or potential harm?
For example, can we detect if a modelβs answer might offend someone? Can we prevent a model from taking an action that could cause real-world damage β especially when used in agents? Thatβs the core of our research: building guardian agents that make AI safer, more reliable, and more aligned with human values.
Ksenia:
I love that metaphor β guardian agent. Weβve spent a lot of time in the weeds today, so letβs zoom out a bit. Do you feel like youβre helping to build AGI? And how would you define AGI?
Amr:
Wow β thatβs a tough one. As a community, weβve been debating this for a while, and the definition has shifted β especially since ChatGPT launched.
To me, AGI means a system thatβs better than any human in any domain. Thatβs the bar.
Todayβs models arenβt there. Theyβre better than the average person in some areas, but not better than the best.
Even in coding β AI might be top 10% globally, but itβs still not better than the worldβs top coder. Same goes for law, medicine, quantum physics, architecture, art, poetry, dance... Thatβs AGI β and I think weβre five years away.
Ksenia:
Just five years?
Amr:
Yes β and hereβs why. Everyone right now is focused on coding. As soon as we get a system thatβs better than the best human coder, that model can start improving itself.
Once that happens, timelines accelerate fast. We expect coding models to hit that level within the next two years. And once they do, I believe that within another three years β nonstop, 24/7 refinement β theyβll be able to master other domains too.
Ksenia:
Thank you. My final question is about books. I believe books shape people. Is there a book β or an idea β that you keep returning to? One thatβs shaped how you think about truth, systems, or intelligence?
Amr:
Iβll give you two β one fiction, one nonfiction.
The fiction book is the Foundation series by Isaac Azimov. It inspired me when I was a kid. And honestly, if you look at some of its predictions β weβre living them now. Iβm sure itβs influenced a lot of us who work in this space today.
The nonfiction one is Sapiens by Yuval Noah Harari. Thereβs a powerful idea in that book: what separates humans from all other creatures isnβt just intelligence β itβs our ability to believe in stories and fiction. Thatβs what allows us to cooperate at scale, to build civilizations.
We believe in things that donβt physically exist β nations, companies, laws, money.
One line from the book stuck with me:
If you give a monkey $100 and a banana, itβll take the banana every time. But we humans β weβll take the $100, because we believe in the story behind it.
Ksenia:
Thatβs fascinating. Thank you so much for this conversation β it was incredibly insightful.
Amr:
Same here β I really enjoyed it. Looking forward to doing another one in the future.
