- Turing Post
- Posts
- 🎙️When Will We Train a Model Once and Let it Learn Forever
🎙️When Will We Train a Model Once and Let it Learn Forever
An Inference with Devvret Rishi, CEO and co-founder of Predibase
Hi everyone – hope the weekend's treating you well. Turing Post now has a proper YouTube channel – ‘Inference’, and you can listen to our podcast on all major platforms: YouTube, Spotify or Apple Podcasts.
What it actually takes to build models that improve over time. In this episode, I sit down with Devvret Rishi, CEO and co-founder of Predibase – the low-code platform built to productionize custom LLMs – to talk about the shift from static models to continuous learning loops, the rise of reinforcement fine-tuning (RFT), and why the real future of enterprise AI isn’t chatty generalists – it’s focused, specialized agents that get the job done.
We cover:
The real meaning behind "train once, learn forever"
How RFT works (and why it might replace traditional fine-tuning)
What makes inference so hard in production – and why Intelligent Inference might be the next big thing
Gaps in the Open Source AI Stack
How Companies Actually Evaluate LLMs
Devvret’s take on agentic workflows, AGI, and the road ahead
I truly enjoyed this conversation and highly recommend to watch it →
This is a free edition. Upgrade if you want to receive our deep dives directly in your inbox. If you want to support us without getting a subscription – do it here.
The transcript (edited for clarity, brevity, and sanity. Always better to watch the full video though) ⬇️
Ksenia: Devvret, thank you for joining me today. Let's start with a big picture. When will we train once and learn forever?
Devvret Rishi: It's a great question. That world is actually here today. Most of the time, when customers use models in production, they're taking a model where someone else has done 99% of the heavy lifting, and then they're doing a last-mile 1% customization. The trend towards "train once and learn forever" will be a shift where people stop using a static model someone else trained and instead have a pipeline that allows them to improve the model continuously while it's in production. Some of our early customers are already putting these types of pipelines in practice, and that's what I'm most excited to build towards.
Ksenia: Predibase introduced the post-training technique RFT. Can you tell us more about it? Is it a big unlock or another tuning trick?
Devvret: it's a great question. I think we were the first end-to-end platform to offer reinforcement fine tuning (RFT) a couple of months ago.
Ksenia: Sorry to interrupt you, can you also briefly explain what RFT is?
Devvret: Reinforcement fine tuning takes a different approach to customizing models. The real underlying intuition is that rather than needing large amounts of labeled data, which is what you need for traditional supervised fine tuning, you can actually do fine tuning with really small quantities of data—think about a dozen examples or so. And you add in, instead of labeled data, this concept of reward functions. Now, reward functions are essentially like rubrics that any individual customer can write that helps you grade a model's output. The idea is that the model will learn how to be able to actually adapt its behavior towards the types of things that you want to incentivize or reward. So as an example, if you're teaching a model how to write code, you might write a reward function that says, "you'll get plus five points if you get the formatting correct, and another 10 points if it compiles, and another 20 points if the unit tests pass." In this way, you kind of teach a model how to be able to actually generate its outputs based on really kind of objective criteria. The goal with reinforcement fine tuning is: if you can measure it, you can improve it. That’s really kind of the key aspect of what we launched.
Regarding whether it's a large shift or just another tool: today, RFT is one of a few different techniques that is really helpful for where customers can start to tune their models. But I think where we're going to go with reinforcement fine tuning is going to fundamentally shift the way that people customize models. Today, reinforcement fine tuning is a one-off training process you do. But where I see RFT really going is becoming part of this continuous feedback loop, where models are getting better online. And I think that is going to be a paradigm shift in how customers are tuning and training models.
Ksenia: Do you see companies already implementing this feedback loop?
Devvret: The very early end of cutting-edge companies, yes. So I'm working with a couple of companies in the healthcare domain that are building co-pilots and assistants for their end patients. And what they do is they have a lot of interaction data with their end patients. And they actually are bringing right now a combination of LLMs as judges to be able to verify how good was that conversation, but also clinicians that are able to actually label different conversations and bring that in as kind of continuous feedback. So rather than needing to take the months of labeling that you'd have, just starting to take the labeling that could be done over a handful of conversations and feeding that into an RFT loop. Today, it's very early and I only see the most cutting-edge companies really doing it, but this is the type of pipeline that I think more and more companies are going to do as we get into more continuously improving models.
Ksenia: What's needed for them to start doing it?
Devvret: There's actually two things, I think. One of them is a feature that we just launched in Predibase, which is a very simple thing that allows you to be able to collect prompts and responses from any of your production deployments automatically. One of the key things with tuning and training models always tends to be data. So the very first feature that we have is one that makes it easier to construct these data sets using live production traffic. The second thing is making it easier to be able to learn from feedback. So rather than needing large amounts of labeled data, how do you directly nudge a model with small amounts of feedback? Now there's lots of techniques here in the platform, like DPO (Direct Preference Optimization) and others that help you learn from feedback. But we're also working on some novel techniques that are more on the research side of Predibase today for how you use reinforcement fine tuning with user feedback data.
Ksenia: Is anything coming soon?
Devvret: I think you can expect that we're going to be publishing a little bit more about this in the next few months. And really what we're going to be talking about is seeing what we've experienced with real life, agentic applications in particular, and the type of feedback that users are looking to be able to give. How do you systemize that into a continuous stream? So the very first thing we'll release is just going to be showing how small amounts of feedback can make performance impacts, and then you can scale out those performance impacts with the more feedback you get.
Ksenia: What's your perspective on agentic workflows and agents in general?
Devvret: My perspective on agent workflows is that they're in the very early innings, and so a lot of things that people are building are a little bit brittle today. The very first thing we need to do when we talk about agent workflows is really define what an agent or agent workflow is. I think about an agent workflow as having two key components. The first is having a chain of multiple LLM calls. If you think about a single, let's imagine you're doing document classification. We don't think about that as an agentic workflow because it's a one-shot process. Whereas if you're having a conversation with a bot, for example, to understand your medical diagnosis a little bit better and schedule a follow-up, that involves multiple turns. So the first piece is I think it's multiple turn, multi-call. And then the second piece of an agentic workflow is that it will likely have the agent be able to do tool calling, make calls towards other functions that it can actually use to fulfill a request on behalf of the user.
My view is that right now the way that these agents get built are quite brittle. A lot of times people have built, I think, a really compelling demo that works well if you are on the golden path. But if you go off the golden path, then the model is triggering a number of different care areas. It's a very simple thing: if you're only 90% accurate on a given LLM call and your LLM has to make five different calls, you're already sub 50% in terms of the user experience. So that last mile of quality with agentic applications becomes really, really important. I saw this when I was a product manager on Google Assistant back in the day, which was one of the first AI agents not using generative AI, but using more classical NLP methods. And I think the bridge we need to make as an industry is getting to more robust, agentic workflows.
Ksenia: You're a product-first founder in a very research-heavy space. How do you keep up, and what areas of research are you following most closely?
Devvret: I really enjoy it actually. My background was I did my undergrad and master's in computer science and had done some initial research as well in CS. Before I like to say I sold out and became a product manager. My real interest in product was because I saw it have kind of like the most cross-functional impact. In research, you usually go very deep into an individual space, whereas in product, you get to see how combination of research works with engineering and design. Even when I was at Google, though, I was working closely with research teams. And so I've always felt very comfortable working in areas that are fundamentally based on things that are still developing and core research that's happening. This space moves much faster than anything else that we've seen. Much faster than when we were talking about some of the shifts to cloud, shifts to mobile. The pace here is truly that you will have a breakthrough on expectation about every week. When we were planning our reinforcement fine tuning launch, one of the most stressful things was knowing if OpenAI or Anthropic or Mistral or DeepSeek or Google, Amazon, Meta, were they going to put out something massive that same week and just suck all the oxygen out of the room. So it is dizzying to be able to keep up, but I think some of the core background and research that I had helps a lot for just being able to understand some of the core techniques people are employing here.
Ksenia: I remember working with a few ML companies before ChatGPT moment. And when it happened, so many needed to pivot. What changes did you make in Predibase when everything exploded?
Devvret: It was a really interesting moment for us because when we started the company, we had a mission to democratize deep learning. And so we had built interfaces and a product and infrastructure to make it easy for people to train their own deep learning models. And then the way I like to reflect on it is, in end of 2022, OpenAI democratized deep learning more than any of us. It had done it through these large pre-trained deep learning models that were able to start having conversations in one shot. The reason I say it was a really interesting position for us was that as a deep learning-oriented company, we had started to see these types of workflows already be popular in our platform, but at a much smaller scale. In early 2022, the most popular piece of functionality on our platform was you could pick a pre-trained deep learning model like BERT as an example, and you could fine-tune that model on your data.
And so the reason that people were coming to us was already because they wanted to start to adopt some of these pre-trained transformers and adapt them towards their data. But the types of use cases, the user journey and the persona just fundamentally shifted. When we were talking in 2021 and 2022, it had to be an NLP engineer, someone who understood the intricacies of a BERT or a T5, or in computer vision, an image model like ViT, in order to be able to really understand the platform. And fast forward to today, I think some of the most prolific AI engineers are ones that just got started in the field a year or two ago. We really needed to, depending on how you think about it, I often think about it as a complete pivot, but it was a real focusing in on our product to say, rather than helping you build deep learning models across the world in general, we decided to really pick and make the bet at the beginning of 2023 on just this one technology, large language models. And we decided to make the bet that the future of LLMs in production were going to be specialized and customized. And so we wanted to build that tuning and post-training stack to make that happen. And then what we quickly found out was inference was going to be a huge part of this game. And so we went really big in that role as well later on in the year.
Ksenia: Regarding more narrow AI, what is the future for companies? Will they use smaller models for specific cases?
Devvret: I genuinely think the entire pile of AI use cases is going to grow. So if you ask me today versus in 2027, you're going to see more use cases of every breed. The difference that I think you're going to see is that, you know, my favorite customer quote is "generalized intelligence is great, but I don't need my point of sale system to recite French poetry." It's this idea that most customer use cases look like something like one of our customers, Checker, does. They look through employee background checks and are looking to be able to extract out very specific information with respect to criminal codes and violations, previous things, employee backgrounds. It's not like they need the model to be able to do French poetry and write Python code. They need to be able to do a really high-quality job on one particular set of tasks. So in enterprise, I think that the majority of use cases are going to move and have already started to shift towards a lot of these narrow use cases that are going to be automation-oriented. Not to say that these general-purpose agents where you can talk about it with anything in the company won't exist and won't be great demos, but some of the high-value use cases are going to look like being a really prescriptive specialized agent that understands how to be able to solve a series of tasks very well.
And I think in enterprise, that's going to be true. In consumer, I think it's a little bit harder to say, actually. I think the versatility is actually very helpful when it comes towards consumer. But I think regardless, one thing we've seen that wasn't obvious in 2023, but I think is obvious in 2024, is we aren't going to live in a world where one model rules at all. You're going to see a mix of open-source models, closed and commercial models, different size parameter ranges. And just like any software tool, people are going to choose the best tool for their individual task. And the thesis we have is the types of tasks that require narrow AI are going to grow at an even faster rate than the type of tasks that the rest of the world are dominating with today.
Ksenia: In this crazy race, what's your planning horizon? 2026, 2027? What's your strategy?
Devvret: Yeah, there's the famous quote from the boxer Mike Tyson, "Everyone has a plan until they get punched in the face." And that's probably true in AI as well. We absolutely have a plan that we think about extending through the end of this year and through 2025. But the truth is AI changes on a weekly basis. And so you need to be able to have a framework that allows you to make decisions as things come in more quickly. It's all guided by our North Star vision, which is to help customers develop specialized AI and help them tune their models and serve and deploy the models. The pieces that are more dynamic is what are the best ways that people will tune their models? We're not religious to say supervised fine tuning is the be-all end-all. We just think it's the best technique today if you care about getting the best performance out of your models. But in a year from now, you're going to see a completely new technique come up. And obviously in 2024, the biggest new technique is reinforcement fine tuning, which we pioneered at the beginning of this year.
And so I think from our standpoint, our vision is really going to be predicated towards two things on the goal of helping customers develop specialized AI: tune those models and then run highly performant model deployments in production. So that means we're going to continue to build infrastructure and training and inference and serving. And we know the things that are going to come out through the rest of this year: advanced techniques in how we specialize models and customize them, advanced techniques in how we run inference better, and then expansion to modalities. We're seeing more customers ask us about things like multimodal and vision or voice today. And so we're going to want to continue to expand that. Within this broader framework of tuning and serving, we'll see where the research really leads and be able to adapt that back into the platform.
Ksenia: Let's go to inference. Why is it so hard for enterprises, and what could make it easier?
Devvret: Inference is a phenomenal example of something that starts off being very easy and then gets hard as you actually peel back the layers of the onion. So why is inference hard? It gets hard between the different stages that an organization ends up being in. There's the crawl stage where, I'd say the difficult part of inference, it's not so much needing brilliant software engineering, but it is that GPUs, for example, might be hard to get. So if you have the largest possible model, you might need eight or 16 H100s running to just have a single replica of a model deployed, which means you're going to have to procure them and then decide if you're able to auto scale them up or down inside of your environments. And you need to set up an initial inference server and framework. Now, a number of companies, including us, have tried to make that easy by open-sourcing inference frameworks and servers. So we've open-sourced Lorax, which is the underlying inference technology that we use, so that anyone can stand it up themselves.
I think what I see though is if you're a smart engineer, you can set up your own inference framework and server. You can get that working and it'll work well to start to feed your initial prototype and application. What's really hard though, and what I think a lot of customers don't want to do, is maintain production inference. And production inference is another ball game. That doesn't mean that the model works and is up 95% of the time or even 99% of the time. It means it's backed by like 99.9 or 99.9999 percent SLAs. That means you need to have resilient fault tolerance, be able to do blue-green deployment updates. That means you need to be able to do multi-region replication in case you have these model deployments set up. And then most critically, GPUs are expensive, which means you need to be able to optimize the models such that you're actually getting the most for every individual token that comes out. And so you're optimizing for total cost of ownership, whether that's using a small model or using some of the techniques we have in our platform like Turbolora, which are just software-defined ways to be able to increase your model throughput by 2x. And so all of these factors come in when you make that shift from "I'm prototyping and able to get inference going" versus "I'm actually going to production. I need monitoring SLAs, high-performance throughputs, and all the other functions that will go with the fact that this is feeding a business-critical application."
The last thing I'll just briefly say with inference is I think that while those are challenges, there's a view that inference is going to get increasingly commoditized as a market. And I don't disagree with this on base model inference in particular. Like there's no reason that so-and-so's DeepSeek endpoint is way better than another company's DeepSeek endpoint. What I think is going to be really interesting is the trend towards what we kind of call internally "intelligent inference." And intelligent inference in my view is back to that initial conversation we were having about, do you have an inference pipeline that hooks into a post-training stack that lets your models get better over time? That's really the future of where we see inference going.
Ksenia: Speaking of open source and you being such a proponent of open source AI models, what is missing in the open-source AI stack? And what are the gaps between this model zoo and production?
Devvret: I think that the open-source model stack has gotten pretty good for some of the core infrastructure that you need to be able to do to set up. Open-source fine-tuning frameworks like ours, Ludwig is an example, are pretty good at being able to help people start to run experiments. Open-source inference frameworks like ours, like Lorax, are pretty good at helping people do some of the initial serving. And then when you want to make the shift towards a managed platform, you have a pretty easy on-ramp to a platform like Predibase that gives you the batteries included, GPUs and infrastructure out of the box. I think one of the things that is missing is a really resilient way to be able to do evaluations. And this is something that has been talked about quite a bit, I think, in the LLM industry. And the truth is it's a challenging problem because LLM outputs at times can be objective and at times can be quite subjective. What's to say if something is a good summary versus if you got decent classification accuracy as in traditional machine learning? So I think I've seen a number of, and am friends with a number of folks that have started frameworks or even companies in LLM evaluation. But I still think it's an open problem today.
Ksenia: A lot of companies build their in-house evaluation systems. The paper “Leaderboard Illusion” showed us that crowdsource evaluation cannot really work at this moment, right?
Devvret: Yeah, I see people tackle evaluation in a number of different ways. Most people I see do evaluation in-house. I'll just start off by saying that's the most common thing that I see. I've seen three versions of evaluations. The first is like they rely heavily on existing data and like some proxies. So a good example is if you're doing document classification, you look at some historical holdout data and you're able to see, you know, how did that model perform? That's the cleanest, simplest way, but isn't always possible because that works for use cases where you have that historical data. The second way of seeing evaluations done is to leverage GenAI itself more heavily. And so here we see people use LLMs as a judge as the most common technique, but they start to use larger models as graders to understand "did this response answer the customer's question?" "Was this a good summary for the output that we were looking for?" and so forth. And then finally, I would say that when it comes to evaluation, the third way that we see it is it sort of feels like vibes in some way, but you'll ship the product and you'll try and collect some sort of product feedback to help you understand whether or not the model was doing the directional type of behavior that you were anticipating. So I mean, I think the truth is a lot of evaluation is in-house and we should not underestimate how much of it is vibes today. But getting good evaluation is going to be critical towards building this kind of continuous improvement.
Ksenia: I agree, but it's so hard, especially with the closed models, because they publish a new model, it's a new persona, and you need to develop basically a new vibe to understand it. So it's really tricky.
Devvret: Yeah, I think open source is helping a lot. In particular, with open-source reasoning models, one of the big things was exposing the reasoning tokens themselves. That happens with, like if you use DeepSeek R1 versus if you're using the earlier generation of reasoning models from a closed provider, you can actually start to run evaluations not just on model output, but what were the series of steps that it was taking in order to be able to get there. Look, I think the point was actually much less obvious in 2023, but I think has become really obvious for most companies now, which is open source is here and going to be here to stay. In 2023, when we talked about the future being open source, the best open-source model at the time was GPT-J. And, you know, it was a far gap from where you thought about GPT-3.5 at the time. Today DeepSeek R1 or V3, QWEN 3, Lama 4, these models are not only on par, but many times actually in the benchmarks doing even better than the leading commercial models. To me, that's actually six months ahead of schedule. I would have thought end of 2024 would have been the earliest that we'd see open source models beat commercial models. I thought they would be on par this year. But it's incredible at the rate of innovation that we've seen in open source.
Ksenia: You don't often speak about AGI. What's your stance?
Devvret: I think AGI is something that's far from the world that I see, in honesty. The world that I see tends to be, rather than artificial general intelligence, like practical specialized intelligence. And so I think that AGI I oftentimes think about in the research labs, folks that will come up with good definitions. If we take a definition of AGI as "a model passes a Turing test," I would suggest that we're probably in that ballpark already. But what's the practical implication of that? I don't spend too much of my time thinking about the Terminator-style scenarios, but I do spend a lot of my time thinking about what does having this generalized intelligence look like when you actually have business processes, like a company Fortune 200 like Marsh McLennan, or another organization like I'd mentioned earlier like Checker, or any of these other companies that have a lot of productivity that they've unlocked via business practices over the past previous decades. And that productivity is about to take a step function change and increase. To me, that's really kind of like the interesting area for where this is going to go. I think there is probably some deeper philosophical questions about what will happen like over a five, 10, 20-year period. But I think we found it pretty hard to even predict what's going to happen 18 months from now in AI. And so I think that's really where a lot of my focus has been is what's going to be the practical application on both enterprise and consumer.
Ksenia: That's a very nice perspective. What has concerned you and excites you the most about the future that you build with Predibase?
Devvret: I think what concerns me probably really comes back actually initially to the evaluations piece. So what concerns me is people see an incredible amount of value. I see these statements from like analyst reports and others from time to time, which often ask "is generative AI a bubble? Are people actually seeing real business value?" I never get that question from like a CIO at a Fortune 200 that's actually working on generative applications. Like I think when you're on the ground and you actually see what these models can do, I think that the question of "is there going to be an ROI there versus the 50 cents or a dollar that the million tokens are going to cost to process," it's not even a question in the vast majority of situations. And if it is, it's really just like a question of selecting the right use case.
But what does concern me is we might go through a little bit of a hype bubble where people are really excited about these far-fetched, you know, French poetry style use cases where the model is doing these amazing like multi-agent step demos. And then you enter into like a little bit of a crash of disillusionment where people have gravitated and attached towards use cases that were probably not the high-value business impact use cases that the models can do today. The high-level thing of what concerns me is if people feel like, if people end up essentially going to and shooting too far ahead and not realizing the business impact that they can have with LLM use cases today, and then they enter a little bit of disillusionment out of that.
I will say though, just given how quickly people have been able to iterate, I think I'm a lot less concerned about that than I was five or 10 years ago. I've been in the AI space for over a decade, right? I think that genuinely AI has probably been a place that for outside of the top 1% of organizations has over-promised and under-delivered between 2012 and 2022. It was an area where we talked about "you're going to have, look at how YouTube does recommender systems. You can bring that to your small and medium-sized business." That never really translated, right? I don't think that's true for the current wave of AI that we're in right now. And the thing that would concern me is if we started to adopt paradigms that make people think that.
The trend that I'm most excited about is it used to be the way in like software development that you'd sort of like perfect and then ship, right? Like you'd go ahead and build and then you'd test, test, test. You'd do some dogfooding and then you'd go ship out your product. As a startup entrepreneur, I think a lot about how do you get really fast feedback from the market as quickly as possible. And, one of my previous mentors had said if you ship something that you're not at least a little embarrassed by, you waited too long. And the thing that I think I'm excited about in AI is you've actually started to see a shift in how people develop to where they're putting out honestly 60% solutions today. And the reason they're putting out 60% solutions is they want to test, "do I have product-market fit with the solution?" And then also start to collect data such that they can improve those models over time. And so I'm excited to see what this more startup way of thinking is going to mean now that it's being adopted, not just by 20, 50-person organizations, but also some of the larger companies adopting GenAI.
Ksenia: Thank you. That's very insightful. My last question is a complete change of gears. What is a book or idea that shaped your thinking? And it can be related to machine learning or completely unrelated.
Devvret: Completely unrelated, I would say a book that I like is "The Happiness Advantage" by Shawn Achor. And so it's a psychologist at Harvard that basically studied behavioral psychology, and in the context of organizations. But what he found was that, you know, it wasn't necessarily that success brought happiness in all cases. It was that happiness actually made you much more likely to be successful. And the book really covered two key things. The first was how just having more positive and happy outlook allows you to be able to do better in terms of the different tasks that you're trying to do, whether it's work or personal. And the second is ways that you can sort of, I don't want to say hack happiness, that sounds very San Francisco biohacking, but the way that you can essentially put yourself in a position to be a lot happier without having to be reliant on these external factors. So I think "The Happiness Advantage" is definitely one that I think I really enjoyed kind of as an overall read and kind of an idea that also extends towards both personal and professional work.
Ksenia: So you think Predibase is a happy organization?
Devvret: I hope so. I mean, I think the truth is if you're working in generative AI today, it's a noisy environment, it's fast-moving, it's competitive. Players, including us, are well-funded, which means that you have a lot of cards on the table. But I think that we and other organizations will also do our best work if we're excited about the future that we're running into rather than if we're, you know, operating out of, for example, predominantly concern or other things like that.
Ksenia: Great, thank you so much. That was wonderful.
Devvret: Of course, yeah, I really enjoyed the conversation today and thanks for again having us on.
Do leave a comment |
Reply