Turing Post
Posts
🎙️Why AI Still Needs Us?

🎙️Why AI Still Needs Us?

An Inference with Olga Megorskaya, CEO @ Toloka

Ksenia Se
June 14, 2025

Hi everyone – hope the weekend's treating you well. Turing Post now has a proper YouTube channel, and you can listen to our podcasts on Spotify or Apple.

In this episode, I sit down with Olga Megorskaya, CEO of Toloka, to explore what true human-AI co-agency looks like in practice. We talk about how the role of humans in AI systems has evolved from simple labeling tasks to expert judgment and co-execution with agents – and why this shift changes everything. Even from the compensation side for humans.

We get into:

Why "humans as callable functions" is the wrong metaphor – and what to use instead
What co-agency really means?
Why some data tasks now take days, not seconds – and what that says about modern AI The biggest bottleneck in human-AI teamwork (and it’s not tech)
The future of benchmarks, the limits of synthetic data, and why it is important to teach humans to distrust AI
Why AI agents need humans to teach them when not to trust the plan

If you're building agentic systems or care about scalable human-AI workflows, this conversation is packed with hard-won perspective from someone who’s quietly powering some of the most advanced models in production. Olga brings a systems-level view that few others can – and we even nerd out about Foucault’s Pendulum, the power of text, and the underrated role of human judgment in the age of agents.

Olga is a deep thinker, and I truly enjoyed this conversation. Watch it now →

This is a free edition. Upgrade if you want to receive our deep dives directly in your inbox. If you want to support us without getting a subscription – do it here.

The transcript (edited for clarity, brevity, and sanity. Always better to watch the full video though) ⬇️

Ksenia Se: Thank you, Olga, for joining me today. You once quoted a phrase from Humans as Tools? The Surprising Evolution of HITL in Agentic Workflows about the evolution of human-in-the-loop systems. You quoted: “Humans are another callable function in an AI agent’s toolbox.” When did you first realize that humans could be seen this way – as callable functions?

Olga Megorskaya: Yeah, thank you, Ksenia. I remember coming across your article, and it really resonated with me – because in describing the evolution of human-in-the-loop in the machine learning industry, you were basically telling the story of Toloka. That really struck me.

When we started, many years ago, our goal was to support classical ML development with training data. Back then – if you remember, it was quite a while ago – everyone was building their own classifiers for different use cases. The tasks that required human-labeled ground truth were fairly simple: labeling cats, dogs, pedestrians, cars, and so on. But the variety was enormous. Thousands of different applications, each with their own classifiers, each needing datasets to train on. We had literally tens of thousands of people annotating tasks across thousands of projects every day.

That’s when we realized the core concept that became foundational to Toloka’s philosophy: to scale human ground truth production, you have to manage human effort technologically. Personally, I’m not a huge fan of the phrase “humans as callable functions” – I prefer to think of it as managing human effort in a structured, technological way to enable scalable, repeatable production of high-quality data.

Then came the next era – what I’d call the foundation model era, the time of ChatGPT and large language models. Suddenly, it wasn’t enough to just be a human annotator. The complexity skyrocketed. You needed deep domain expertise. Now we were bringing in PhD physicists, senior software engineers, legal professionals – real experts – as the source of ground truth. The variability of the tasks, though, decreased significantly. Instead of training thousands of separate classifiers, the industry shifted toward training a small number of foundation models that could be fine-tuned or adapted to a wide range of downstream tasks.

Now, we’re entering a new phase again: the age of AI agents. What’s fascinating is that this stage combines the challenges of both previous eras. On one hand, tasks are growing even more complex – demanding more time, more precision, more domain expertise. There was a time when the average annotation task took 30 seconds. We actually had that number hardcoded into our platform! Now, it’s not unusual for a single item in a dataset to take 10 hours – or even several days – of expert human work.

On the other hand, variability is rising again. AI agents now operate across a wide range of surfaces. It's no longer just about chatbots. We’re dealing with multiple modalities, diverse interfaces, and hundreds of interaction scenarios – each of which needs to be tested, evaluated, red-teamed, and so on.

So once again, the ability to manage human effort at scale becomes crucial. Matching the right human expert to the right task – treating each person as a vector of skills, availability, cost – becomes a key part of building successful data pipelines. And in that sense, human and AI agents start to look very similar. Both have skills. Both have capacity. Both can integrate with tools. And both have a cost.

At Toloka, we strongly believe in this hybrid collaboration between human agents and AI agents. We think the future lies in that intersection.

Ksenia: I also remember you once sketched your journey at Toloka—from crowd-based data labeling, to human preference feedback for RLHF, to expert evaluation in niche domains, as you just described, and now toward co-agency in multi-agent teams.

So my first question is: what is true co-agency? Can you define it for me in a way that everyone can understand?

Olga: Well, the major difference of the era of AI agents is in – I would say – two things.

One is, as I have mentioned, this interaction on a large amount of surfaces. It's not about chatting with a model in a chatbot anymore. It is about using your computer as a surface of interacting with the AI agent, using all your tools that you're using day to day, interacting with the agents, et cetera, et cetera.

And the second is iterations. So, multiple iterations between the user and the system. We are now working with long-term trajectories that start from one point and then can go in many different directions. And this is – principally, from the point of view of collecting training data and creating benchmarks for training and generative systems – this is the most important difference that we see right now compared to the previous stage of just creating dialogues between the model and the user.

Ksenia: So what is co-agency in your definition?

Olga: Well, for us, co-agency is when an AI agent and a human agent are solving the same task together.

There are things that AI agents can greatly help with. For example, there are things where AI agents are much better than human agents – for example, decomposing the task, creating the plan of a task, helping a human validate that the human actually follows the steps of the plan. Because humans usually have problems following the plan – not because they are lazy or something, but because we, as humans, bear much bigger context. Like, basically all our life experience is our context. And that's why we quite often tend to skip some steps and some stages – because they are obvious for us, they're intuitive for us. But what is intuitive for one person can be not intuitive for another.

And when we are talking about scalable human operations, you need to make sure that none of the steps is missed. And this is where AI agents are helping human agents a lot – helping them look with a fresh eye at the results of the task performed, and helping you see some potential mistakes and problems.

These are the things where AI agents are helping our human experts in performing the tasks.

At the same time, there are things that AI agents cannot do. And one of the most important things is that sometimes they don't know what they don't know. And this is – ironically – the most important skill that we are training our experts in right now: to distrust LLMs, to see when actually the plan that the AI agent has created is wrong. And like, in 70% it would be correct, but in 30% it would be wrong.

And this is the most important and most responsible part of human input – to say, “No, no, I'm not listening to you here. I am applying my own amazing human wisdom to do it my way.” And this is actually a very important source of the signal for the whole system.

Ksenia: For humans to trust themselves more!

Olga: Yes, in certain cases, yes. And unfortunately – and that’s the essence of the task that we are solving – you cannot tell in advance when you need to trust AI and when you need to not trust AI. And this is the moment of judgment that brings the useful signal to the system.

Ksenia: How interesting. Is it the main bottleneck for the true co-agency between humans and AI, or are there others?

Olga: This is fundamentally the most important bottleneck.

Apart from that, obviously, there are engineering challenges that will be solved sooner or later – like the amount of integrations that agents have right now. Not so many agents can freely use the computer and some other applications. But I don’t know – in a year from now, it will not be a problem for the majority of applications.

It will take some time for AI agents to be able to solve more complicated tasks in more niche domains and in more niche applications – I don’t know, AutoCAD, engineering applications, something else – something that is not in the top hundred most popular applications in the world.

But again, I think these are mostly engineering limitations. I think there will always be this long tail of use cases where agents are not yet engineered enough, and where you will need the interaction with a human expert to finalize and to actually solve the task.

But within all these operations, the major essence is to decide when to trust the agent, and when not to trust the agent and to trust the human. And this is a fundamentally difficult thing.

Ksenia: How do you train people to do that?

Olga: This is the thing that we have learned long ago. Basically, training people is not much different from training models. People are best trained by examples. So, you create a dataset of use cases, you show them to people, explaining: here is right, here is wrong. And while processing through those examples, people are training their own neural network and then start to understand the logic.

I think that this is still the most efficient way of teaching.

Ksenia: When I talked to Edo Liberty from Pinecone, he said that after ChatGPT, they had to rewrite the whole architecture, basically, for their vector database. Did you have any problem like this? What happened to you after the ChatGPT boom?

Olga: Well, for us, ChatGPT per se did not change a lot in terms of technological architecture. But I rather think that now, these AI agents are something that fundamentally will change the technological foundation of the service.

The coming of ChatGPT, for us, marked the major milestone of switching our focus from working with crowdsourcing toward working with highly skilled human experts. And that presumed also some new technological challenges, and new areas of focus and investment. Because when you're dealing with systems where it is important to bring in highly skilled professionals in certain domains, you want to invest more into, again, technologically selecting and attracting those people, qualifying and checking their level of expertise, and then investing a lot into building the community and building trust with those experts. This is kind of a separate and non-technological part of the business.

But I do believe that it is crucially important – when we are talking about people who are bringing their expertise to train AI – that these are people who are actually acting professionals in their domains. You cannot bring useful signals to AI systems if everything you do is being an AI trainer for eight hours a day, forty hours a week.

You need to bring the real insights from the real market. You need to be up to date with your profession. That means we need to be able to attract highly skilled, highly paid professional specialists, and to offer them something that motivates them to participate in those kinds of tasks.

So, I think this is what brought the major change to our business with the appearance of ChatGPT. And now, AI agents are bringing – on top of that – a new technological foundation for co-agency between human and AI agents.

Ksenia: What is the role of synthetic data in your datasets? (we just covered “How HITL is Saving AI from Itself with Synthetic Data” here)

Olga: By nature of our business, we are helping our customers with datasets that are mostly purely human data. Since we’re working with some of the most technologically advanced companies in the world, such as Anthropic, Amazon, Microsoft, et cetera, just creating purely synthetic data is something that our customers can do by themselves.

However, there is always a limit above which you cannot gain any substantial profit from training on synthetic data. And this is when you need human ground truth – to evaluate the quality of synthetic data and to provide the next-level signal. That’s why we are focused mostly on purely human data.

At the same time, of course, we use different technological approaches in delivering those datasets. You need to be able to use approaches similar to synthetic data generation when you want to ensure diversity in the dataset. So, you need to think about the taxonomy of the dataset, to ensure it covers all the variety of topics.

For example, if you're creating a training set in the field of finance, you want to make sure you’re covering all the major topics that are important for the model to learn about finance. This is when, at first, you would invite a human expert to help define the taxonomy. Then, you would generate – synthetically or semi-synthetically – a skeleton of that data. And then you would call for human experts again to validate, update, and upgrade that data based on the skeleton.

Ksenia: That’s very interesting.

Olga: Synthetic data is a very powerful tool. So obviously a lot of companies are using it. At the same time, I think there is a common understanding in the industry that only synthetic data is just not enough. You always need human oversight. You need to evaluate the quality of the synthetic data at the very least. You also need to develop the benchmarks in according to which you want to measure the quality of your models. And usually benchmarks they require heavily human, deep dive of human experts to creating them.

Ksenia: Is human-AI true co-agency a path to AGI, or is it AGI? What's your take on it?

Olga: I don’t know. To me, discussions about AGI, to be honest, are quite – I don’t know – unpractical. So I prefer to be more down to earth, looking at what we can do from the engineering point of view. I definitely think that co-agency and hybrid systems, where humans and AI are collaborating, are the next step. Whether it is the last step or not, I honestly don’t know. I have no idea. I personally don’t think that there is such a thing as complete AGI. There will always be some need for ground truth, which you cannot get from anywhere apart from human wisdom.

Ksenia: That’s why I love talking to practitioners – because basically every time I ask a question about AGI, people say, “Well, I try to look more practical on that.” You create this whole thing. You really know how it works practically. So I think it’s very helpful to have this narrative out.

Olga: I think this is maybe both good and bad, because we are developing step by step, and every step seems quite small. And maybe within those small steps, there a risk to lose the bigger picture. Even looking back at the evolution of Toloka throughout those 10 years, we see what a giant path there was. But at every moment in time, these were very minor and very practical steps. Probably 10 years ago, it was hard to imagine what it would be like today.

Ksenia: Well, it's a famous phenomenon – that first you call something a wonder, and then it just becomes software. So when you think ahead a couple of years – five years – what excites you the most, and what are your concerns?

Olga: Well, again, speaking about our engineering and practical day-to-day things we are dealing with – what really excites me is the technological part of this hybrid collaboration. Because I think it opens a lot of opportunities for bringing in and attracting many more human experts from a large variety of different domains into the whole AI production world.

Right now, we are still, I think, in kind of a bubble. And there are lots of parts of the human economy that are basically not touched by AI at all. And it will take us some time to go from the apps that help you set up a calendar or – don’t know – book tickets or something else, toward some hardcore real economy challenges. I think this is a very interesting opportunity. And what excites me is to see how AI will be unlocking new domains of human knowledge – going further from the offices, further from our laptops, into the real world. I think that’s an exciting thing to observe.

Ksenia: Any concerns?

Olga: Every radical change brings certain concerns. I wouldn’t call them concerns though. We are in a very interesting position where, actually, a lot is in our hands to ensure that this technological evolution – or revolution, whatever you call it – actually goes smoothly, and that the systems are under control.

And the beauty of that is that we actually have instruments to influence it. We are doing a lot of tasks related to red teaming, to ensure that AI agents are working in a safe and responsible way. We are working a lot on developing benchmarks, which basically guide the directions where the models are developing.

We’re actually building a system that allows different human specialists from different professions to benefit from and get additional opportunities for income from training AI. So we have in our hands opportunities to help people not be afraid of AI replacing them, but rather to see new opportunities in AI production – offering them new sources of income, for example.

That’s why I think we at Toloka are in a very interesting position where there’s no point in being concerned. Rather, there’s a point in taking those efforts in our hands and basically shaping the future the way we want it to happen.

Ksenia: How do you work with benchmarks? Because you know, the recent scandal with benchmarks demonstrated that it's very hard to rely on human judgment. It can be tricked; it can be played. So how do you work with benchmarks? And in a more general sense, what do you think about benchmarks for models?

Olga: I think that benchmarks are super important. And I do see that now the industry has already adopted this concept. Because, like, three years ago, nobody was talking about evaluation at all. Two years ago, everybody was saying we need some evaluation and some benchmarks, but nobody knew what to actually do with that.

Inventing a new benchmark is a very serious intellectual effort – because you need to come up with the design of the benchmark. You need to understand: what are the questions you need to get answers to? And that's why this is a very responsible and very prominent job – to design the benchmarks.

But the problem with public benchmarks is that they get leaked very fast, and basically, it's quite hard to rely on them only. That's why what we see in the industry is that industrial players usually select the kind of benchmarks that they want to rely on and then – with our help – design specific custom benchmarks for internal use, to make sure they are not leaked anywhere, etc.

I would say that right now, a very popular benchmark related to agency is the TAU benchmark. It was designed by the company Sierra – maybe a couple of years ago, maybe a year ago. Now it is getting a lot of traction, in terms of: everybody wants kind of similar benchmarks to evaluate their models against.

SWEBench is a very popular one for coding also. There are some funny benchmarks, like for example, the Gaia benchmark, which is designed out of use cases that are presumably easy for humans to solve, but very hard to solve for AI agents.

It's interesting because it’s super impractical – like, you would never deal with such tasks in any of your real-life scenarios. But at the same time, they’re interesting because they illustrate the limitations of the capabilities of modern AI agents.

Ksenia: Do you think it's possible to create some sort of a general benchmark, or is it still mostly internal benchmarking that makes sense?

Olga: To be honest – again, from a practical point of view – I do not quite believe in one general benchmark to measure everything against. I rather believe in a set of different, specialized benchmarks. They are very useful. They are practically useful – because you set your goal against this benchmark: “I want to hit 90% of this benchmark,” and you're optimizing the model to reach that level. Then you say, “Okay, now I want to choose another benchmark and optimize for it.” And this is basically like the guiding steps for the industry to develop.

There are some interesting initiatives about general, industry-wide benchmarks or general datasets. We, for example, have contributed – with the community MLCommons – in creating such a red teaming dataset.

Ksenia: That's a recent one?

Olga: Yeah, it was released in December 2024. So that's an example of an attempt to create something that should be used across the industry. I think these are noble initiatives. But from a real production perspective, of course, every team would be willing to define their own paths and to define specific benchmarks that they want to hit along that path.

Ksenia: Another role for humans.

Olga: I think that is really a role for a human – to be the source of the ground truth and to be the benchmark. Because ultimately, the whole concept of AI is that the human expert is not an executor. The human expert is the benchmark.

Ksenia: That's a good way to put it. Do you think there will be a time when humans are just tools for AI?

Olga: I don’t think we should put it that way – because philosophically, it’s not about being a tool. It is about, again, being the ultimate measure of the ground truth. So AI is trying to keep up with human expertise.

I think we’ve already reached the situation in some areas where the average model is performing better than the average human. And that’s why it becomes a specific job to actually collect the collective human wisdom that is still higher than the wisdom of the AI. But that’s the essence of the whole development and progress.

From the technological point of view – again, going back to where we started – I do think that when we’re talking about human operations, we need to think about human operators the same way we think about AI agent operators. Because technologically, this should be a seamless flow that navigates the task between humans and AI. They’re working in collaboration. So this should be a process that doesn’t have those borders: “This is the human part. This is the AI part.”

Ksenia: Thank you. Now to my last questions: I believe books shape humans. What is the book that influenced your philosophy – for the company or in general – that you would like to share?

Olga: You know, I was recently thinking about that, and probably this is not the most common answer, because my favorite book is a purely fictional one. And that’s Foucault’s Pendulum by Umberto Eco.

This is a book that has lots of cultural references and lots of layers inside it. But one of the layers that hit me recently – and I didn’t think about it when I first read it, many years ago – but when I reread it recently, I was just astonished by it. Even though it was written in the 1980s, long before any AI happened, basically what is described – on one of its levels – is the power of text. The power of literally letters.

It describes how a simple sequence of letters, simple text, can by itself create whole new concepts, whole new societies, whole new religions – and ultimately, questions of life and death for particular people.

The whole story of the book is about how it all starts with people finding some small piece of paper with some letters written on it. And depending on how you fill in the gaps between those letters, you can think of it either as the starting point for secret societies and hidden treasures, or as a simple note a wife wrote to her husband to pick up some products at the market.

And these are two very different trajectories that can be generated out of one simple sequence of letters. And if you think about it – this is what we’re observing now with large language models. This fascinating power of the text that, back 50 years ago, was a purely fictional, intellectual exercise – and now we’re actually living it.

So I think that’s something that excites me about the books I’ve recently read.

Ksenia: Oh, that’s fascinating. Thank you so much for this interview. It was very insightful.

Olga: Thank you, Ksenia.

Do leave a comment

Reply

or to participate.