- Turing Post
- Posts
- 🎙The Truth About RAG
🎙The Truth About RAG
An Inference with Amr Awadallah, founder & CEO of Vectara
Great news! In only 2.5 months we reached 1000 subscribers on YouTube. Please subscribe as well, or listen our interviews as podcasts on Spotify or Apple.
What is actually happening with retrieval-augmented generation (RAG) in 2025? In this episode of Inference, I sit down with Amr Awadallah – founder & CEO of Vectara, founder of Cloudera, ex-Google Cloud, and the original builder of Yahoo’s data platform – to unpack it all. We get into why RAG is far from dead, how context windows mislead more than they help, difference between RAG and fine-tuning, and what it really takes to separate reasoning from memory.
Amr breaks down the case for retrieval with access control, the rise of hallucination detection models, and why DIY RAG stacks fall apart in production. We also talk about the roots of RAG, Amr’s take on AGI timelines and what science fiction taught him about the future. This interview is packed with insights.
If you care about truth in AI, or you're building with (or around) LLMs, this one will reshape how you think about trustworthy systems. Watch it now →
This is a free edition. Upgrade if you want to receive our deep dives directly in your inbox. If you want to support us without getting a subscription – do it here.
The transcript (edited for clarity, brevity, and sanity.) All supporting links are in the YouTube description, just below the video. Check it out):
Ksenia Se:
Hi Amr, welcome to Inference by Turing Post. Very happy to see you here.
You built large-scale data systems at Yahoo, helped launch the data platform era at Cloudera, and led developer strategy at Google Cloud. Now, at Vectara, you're tackling something deeper: the layer of trust – the foundation of truth for AI.
So my question is: as more models come out with bigger context windows, some people are starting to say RAG is dead.
Is it? What do people misunderstand about RAG?
Amr Awadallah:
Even the largest companies – Google, Amazon, Microsoft – they’re all still leveraging RAG, even with these larger context windows.
The larger window just means we can give the language model more information in the prompt. But we still have to pick what to give it. If we just stuff that window full of information – some of it irrelevant or noisy – the model actually struggles. It can hallucinate more. It might not find the right facts buried inside all that input.
These models are pretty good at what's called single needle retrieval – they can find one, maybe two useful pieces of information in a haystack. But they’re not great at finding multiple relevant facts all at once. They also tend to pay more attention to content at the beginning or end of the window, and less to the middle.
RAG, on the other hand, separates memory from reasoning. Memory is your knowledge; reasoning is your intelligence. It’s how humans work too. Take a high school exam: if it's open book and you have an assistant who highlights exactly the right sentence to focus on, you'll perform way better. Same with language models. If we add a smart retrieval layer that surfaces the most relevant “needles,” the model can reason more effectively with that curated input.
And there’s another big benefit: security.
If you dump everything into the context window, you can't control what users might access. They might use prompt injection attacks to retrieve things they shouldn’t. But with a retrieval system in front, you can filter and mask sensitive information. No matter how someone phrases a prompt, they won’t be able to access what they're not allowed to see.
Ksenia:
Is RAG the main component in building trust in AI – or are there others?
Amr:
Yes, RAG – with access control – is definitely one of the key components. Retrieval with proper access control is one of the best ways to mitigate against information leakage, especially from prompt injection attacks.
Compare that to fine-tuning. If you take a large language model and just stick all your information into it – whether through a large context window or actual fine-tuning – then someone could eventually extract that information with a cleverly structured prompt. With RAG, that’s not possible, because the information never even reaches the model unless it’s relevant. The retrieval layer filters it before it gets to the LLM. Make sense?
Ksenia:
Absolutely. So then – why do people keep saying RAG is dead?
Amr:
Honestly, nobody serious in the industry says that. Maybe some reporters or analysts, but anyone actually deploying AI over their company’s data – they’re using RAG.
All the major players – Microsoft, Google, Amazon, even OpenAI – they’re doing it that way. Just a few weeks ago, OpenAI launched something called Open Connectors. That lets you connect your ChatGPT session directly to your data. That’s RAG. They're not shoving all your data into the model or cramming it into the context window. No – they retrieve the right nuggets on demand.
Let’s say you ask a question, and the answer is in your Google Drive. RAG fetches the three or four relevant documents and feeds just that into the prompt.
So RAG is absolutely a foundational architecture – and it will be with us for many, many years. I don’t see it going away. I’d challenge anyone who says otherwise.
There’s also a big performance angle here. Putting a lot into the context window is expensive. The compute cost scales with what we call order N squared – meaning, if you double the number of words, you quadruple the cost.
Smart retrieval systems, on the other hand, are sublinear – around log N. So they’re way more efficient, both in compute cost and in latency. They’re faster, cheaper, and more targeted.
So again – if you look at any serious, production-grade system that’s reasoning over data, it’s using RAG.
Ksenia:
Okay, okay. Next time I hear an ML person saying it’s dead, I’ll send them straight to you!
Amr:
Yes, please.
Ksenia:
A lot of teams think that building a RAG system just means plugging into a vector database. As I understand it, you’ve introduced a lot more – metadata, updates, citation tracking, scoring, observability.
So where do DIY or in-house systems fall short in your opinion?
Amr:
There are actually two questions here.
First – is retrieval only about vector systems? The answer is no. If it were, we’d be calling it VAG – vector-augmented generation – not RAG. The “R” stands for retrieval, and retrieval can happen in many ways.
Vector databases are great for semantic data. If you're searching over unstructured documents and doing semantic matching, then yes – vectors and vector DBs are very effective.
But if you're working with knowledge graphs or instruction-based data, something like Neo4j would be more appropriate.
If you’re dealing with semi-structured documents, MongoDB might be a better fit.
And for traditional structured or numerical data, Snowflake or Oracle would make more sense.
So RAG is really about combining all these sources – and retrieving the most relevant “needles” from across them, based on the task at hand. Vector DBs are just one part of that picture.
Now, on the second part of your question – building this in-house.
It’s true that you can build a quick demo pretty easily. Take a few documents, plug them into a large context window or vector DB, and set up a chatbot. You could do that in a few hours. Maybe a couple of days, tops.
But scaling that to production is a completely different story.
In a real organization, you’ve got dozens of data types: documents, numerical tables, semi-structured records, graph-based knowledge. You need to handle all of them.
You also need to enforce access control across every data source – to make sure users only see what they’re allowed to see. Then there’s hallucination mitigation, accuracy, traceability – all of which we can talk more about.
And on top of that, you have to keep up with the fast pace of model evolution. One week it's better table extraction from PDFs. Next week it's better OCR from images. If you're building this system yourself, you have to stay current on all of that – constantly.
That’s exactly the problem Vectara solves for our customers.
We say: let you build the intelligence and business logic that’s unique to your company – and let us handle the heavy lifting on the system side. We’ll stay up to date with the latest models, handle accuracy and hallucination issues, and keep your data secure against prompt attacks.
Ksenia:
I’ve spoken with a lot of machine learning companies that started in 2020 or 2022 – and you started in 2022, if I’m not mistaken. That was before the ChatGPT moment, before this generative AI boom, before everyone started talking about RAG. What did the ChatGPT moment change for your company? Did you pivot?
Amr:
It’s a good question. Honestly, it didn’t change much for us – and that’s because, as you mentioned in the intro, I used to work at Google.
At Google, they had a system called MEENA and it was already as good as ChatGPT in many ways. They had it two years earlier. It was amazing. It could hold a coherent conversation. Maybe not reasoning at today’s level, but it could definitely pass a Turing Test-like interaction. So we knew generative capabilities were coming. That was clear.
From the very beginning at Vectara, our focus was retrieval:
How do you find the most relevant needles in the haystack – and how do you do that in a way that lets businesses define their own logic around it?
For example, maybe a document from Sue should always rank higher than one from Joe – because Joe sometimes gets things wrong. Or a document from last month should take priority over a PowerPoint from last year. That kind of business context was part of how we designed retrieval from day one.
Then ChatGPT came out – and we simply plugged our retrieval engine behind the LLM. And that’s when the hallucination problem became obvious.
Even when you ground the model – meaning, even when you tell it “here’s the source of truth, please answer based on this and don’t make anything up” – it still makes stuff up.
Things are improving, thankfully. When ChatGPT first launched, grounded hallucination rates were around 10–20%. Now, advanced models – including ours – are down to about 1%, which is amazing.
But still dangerous.
If you’re using this for medical diagnoses, legal contracts, supply chain analysis, anti-money laundering – that 1% can be the difference between major success and complete disaster.
So that’s where we focused. Back then, we set our sights on solving that problem.
And today, we’ve built one of the most successful hallucination detection models in the world. It’s called the Hughes Hallucination Evaluation Model (HHEM) – open source on Hugging Face – and it has over 4 million downloads today.
Ksenia:
That’s incredible. A couple of months ago, I spoke with Sharon Zhou (ex-Lamini), at that time sh'e’s also been working on hallucinations. She told me they had to do “surgery” on the models – as in, actually correcting the weights.
But your approach is different, right? You’re verifying the output against the source?
Amr:
Yes and no. We actually do both approaches now. We don’t build large language models from scratch ourselves – we’re not funded at that level. As you know, training one of those models can cost $50–60 million.
Ksenia:
Or you can talk to the Chinese and do it much cheaper.
Amr:
And that’s exactly what we did. We started with LLaMA – we were using LLaMA 2 at the time – and we fine-tuned it. So yes, we do modify the weights inside the model. That kind of fine-tuning helps reduce hallucinations.
The process is simple in principle: when the model gets an answer right, we give it a cookie. When it gets it wrong, we slap it on the face – punish it, in other words. Over time, it learns not to answer when it’s unsure. That’s what minimizes hallucinations.
So yes, we fine-tune – but here’s the key: even after you do all that surgery on the model weights, hallucinations still happen. If you look at our hallucination leaderboard – just Google it, it’s the top result – even the best models from OpenAI, Google, and others still hallucinate around 1%.
Which means you can’t just depend on the model as-is, even after fine-tuning. You still need what we call a guardian angel – or more precisely, a guardian agent – that monitors the model’s output and catches hallucinations when they happen.
That’s the only way to truly scale trust. Let’s say you’re using an LLM for customer support. If the guardian agent detects no hallucinations, the response can go straight to the customer. If a hallucination is suspected, then the answer gets routed to a human reviewer before it’s sent out. And that’s the setup we believe leads to massive productivity gains – safe, reliable automation with a human safety net where it counts.
Ksenia:
Just to clarify – do you have a human in the loop or not?
Amr:
Yes – we advise our customers to include a human in the loop, depending on the use case. Whenever the hallucination risk is high, that’s when the human should step in.
That said, about four weeks ago we released a hallucination correction model. It helps bridge the gap. So if our detection model flags an issue, we first pass it to the correction model. If it successfully fixes the hallucination, no human is needed.
But if the correction model fails, then you activate the human reviewer.
Ksenia:
You mentioned you open-sourced the model. I know you’re a big proponent of open source. Why is that important to you – and how does it help your business actually make money?
Amr:
So first – open source doesn’t help you make money directly because, well, you’re giving something away for free.
But what open source does help with – and I’ve seen this firsthand at Cloudera, my previous company, – is awareness in the developer ecosystem. There are a lot of developers who won’t even try your product unless there’s an open-source component.
So by releasing something open source, you reach far more developers. And later, that gives you an opportunity to build relationships with them or their organizations – and sell them your broader platform.
That said, we’re not open-sourcing everything. At Cloudera, we made the entire platform open source – and that made it really hard to monetize. Some customers would just say, “Why pay you? We can just use it ourselves.”
And then some of the big hyperscalers – I won’t name names, but one of them starts with the letter A – would take the software, give you nothing in return, and you’d have no recourse.
So with Vectara, we’re being more tactical. We haven’t open-sourced the full SaaS platform for RAG – that stays proprietary.
But we did open source our hallucination detection model – our guardian agent – because that raises awareness. It shows the community that we’re one of the leading companies solving this problem and delivering accurate answers.
Ksenia:
What are your next steps in working on hallucinations?
Amr:
The next big step is detecting hallucinations in multimodal content.
Right now, we focus mainly on text – extracting information and making sure the model’s response is grounded in it. But hallucinations can also happen with tables – and that’s really important. You need to verify that answers based on table data actually match what’s in the cells.
Then there are diagrams – and with diagrams, it’s not enough to just OCR the words. You need to understand the visual structure to detect if the model is hallucinating.
And later on, we’ll extend that to video and images as well.
Ksenia:
That sounds exciting. Multimodal models are definitely the next big thing.
I was reading your LinkedIn and really liked your post setting the record straight on who actually invented RAG. I’m passionate about history – we cover a lot of it at Turing Post.
So for the people watching and reading – who really did invent RAG?
Amr:
That’s a great question – and the reason I wrote that post is because some startups and individuals today are claiming, “We invented RAG!”
And I’m like – no, this approach goes way back.
The roots of RAG are in the research community – pioneers in information retrieval (IR) and natural language understanding (NLU), going all the way back to the 1960s.
One early example was a system called BASEBALL. It showed this idea that, if you want reliable responses, you should mimic how humans operate:
First, retrieve the right facts – then reason based on those facts.
If you combine those steps, you get more accurate and more secure systems.
So yes, credit should go where it’s due – to the IR and NLU researchers who have been refining these ideas for over 60 years.
Ksenia:
Sixty years – people tend to forget history.
Amr:
Exactly. And I like to remind them – and I can see you do too.
What’s different today is that we now have LLMs that are very good at generation.
Back then, generation was more symbolic – stitching together parts of sentences, decision support, templates. Now with LLMs, we have fluent, context-aware generation.
Ksenia:
With models getting more multimodal and more complex – do you think it’s possible to eliminate hallucinations completely?
Amr:
Not with the current transformer architecture.
Transformers and deep neural nets – the kind Geoffrey Hinton won his well-deserved Nobel Prize for – are inherently probabilistic.
And with probabilistic systems, there’s always a chance of a false positive or a false negative. That’s where hallucinations come from.
Over the past few years, we’ve made great progress – hallucination rates have dropped to around 1%. Maybe we’ll get it down to 0.5%. But we seem to be hitting a plateau.
The good news is: if we can perfect hallucination detection, then the problem becomes manageable. You can imagine a system saying, “This sentence in this 20-page report might be off – have a human review it.” If that happens only 0.5% of the time, it’s something we can live with.
So that’s what we’re focused on – improving detection to the point where hallucination becomes a non-issue.
Until someone invents a brand-new architecture that isn’t probabilistic – and doesn’t hallucinate – we’ll still be living in this space. Will that happen? I don’t know. But I do know that as long as we’re using transformers, hallucinations aren’t going away.
Ksenia:
There are a bunch of new architectures emerging. Is there any particular one you're especially interested in?
Amr:
I haven’t seen any architecture yet that isn’t probabilistic in nature. The only real alternative would be symbolic approaches – which are more accurate, sure, but extremely hard to scale. They just don’t work at the same level as LLMs.
Ksenia:
That’s really interesting. There’s still so much to research. Do you do any research yourself?
Amr:
Not personally – but my team does. At Vectara, we’re about 60 people. Five of them are fully focused on machine learning research. Another 25 are building the platform – security, reliability, observability, and so on.
Our ML research is laser-focused on the guardian agents problem: How can we monitor the output of LLMs – not just for accuracy, but also for bias, toxicity, or potential harm?
For example, can we detect if a model’s answer might offend someone? Can we prevent a model from taking an action that could cause real-world damage – especially when used in agents? That’s the core of our research: building guardian agents that make AI safer, more reliable, and more aligned with human values.
Ksenia:
I love that metaphor – guardian agent. We’ve spent a lot of time in the weeds today, so let’s zoom out a bit. Do you feel like you’re helping to build AGI? And how would you define AGI?
Amr:
Wow – that’s a tough one. As a community, we’ve been debating this for a while, and the definition has shifted – especially since ChatGPT launched.
To me, AGI means a system that’s better than any human in any domain. That’s the bar.
Today’s models aren’t there. They’re better than the average person in some areas, but not better than the best.
Even in coding – AI might be top 10% globally, but it’s still not better than the world’s top coder. Same goes for law, medicine, quantum physics, architecture, art, poetry, dance... That’s AGI – and I think we’re five years away.
Ksenia:
Just five years?
Amr:
Yes – and here’s why. Everyone right now is focused on coding. As soon as we get a system that’s better than the best human coder, that model can start improving itself.
Once that happens, timelines accelerate fast. We expect coding models to hit that level within the next two years. And once they do, I believe that within another three years – nonstop, 24/7 refinement – they’ll be able to master other domains too.
Ksenia:
Thank you. My final question is about books. I believe books shape people. Is there a book – or an idea – that you keep returning to? One that’s shaped how you think about truth, systems, or intelligence?
Amr:
I’ll give you two – one fiction, one nonfiction.
The fiction book is the Foundation series by Isaac Azimov. It inspired me when I was a kid. And honestly, if you look at some of its predictions – we’re living them now. I’m sure it’s influenced a lot of us who work in this space today.
The nonfiction one is Sapiens by Yuval Noah Harari. There’s a powerful idea in that book: what separates humans from all other creatures isn’t just intelligence – it’s our ability to believe in stories and fiction. That’s what allows us to cooperate at scale, to build civilizations.
We believe in things that don’t physically exist – nations, companies, laws, money.
One line from the book stuck with me:
If you give a monkey $100 and a banana, it’ll take the banana every time. But we humans – we’ll take the $100, because we believe in the story behind it.
Ksenia:
That’s fascinating. Thank you so much for this conversation – it was incredibly insightful.
Amr:
Same here – I really enjoyed it. Looking forward to doing another one in the future.
Do leave a comment |
Reply