Turing Post
Posts
Topic 44: The Human Touch: How HITL is Saving AI from Itself with Synthetic Data

Topic 44: The Human Touch: How HITL is Saving AI from Itself with Synthetic Data

we explore how human-in-the-loop systems are keeping synthetic data grounded, useful, and safe in the age of AI self-training

Alyona Vert. & Ksenia Se
June 11, 2025

Since late 2023, AI developers and experts, such as Ilya Sutskever, have started talking about the limits of real data. We've reached a peak, and there isn't enough web data left to train models, so we need to find new ways to replenish datasets. Most are betting on the possibilities we can unlock using synthetic data — data generated by models themselves that can be used at different stages of their training.

It sounds simple: why not just generate more data? But it's not that easy. Synthetic data often lacks accuracy, and training models on low-quality outputs can lead to model collapse – a serious degradation in performance. In short, we’re nowhere near a world where AI can generate training data without oversight.

So what’s the practical path forward?

Today, we’ll explore how AI teams are using human-in-the-loop (HITL) systems to make synthetic data useful and safe. We’ll break down how humans guide and validate the process – and show real-world examples of how this is being implemented today.

In today’s episode, we will cover:

What synthetic data is – and why it needs a human touch
The Engines of Creation: Methods for Generating Synthetic Data
The Ghost in the Machine: The Specter of Model Collapse
The Human Lifeline: How HITL Grounds Synthetic Data
In the Trenches: How Real Companies Use Synthetic Data + HITL
- OpenAI’s GPT-4.5
- Microsoft and Phi-4 training strategy
- Walmart experience
- NVIDIA's Cosmos and the Future of Computer Vision
Conclusion
Sources and further reading

What synthetic data is – and why it needs a human touch

AI generating data for itself – that’s the core idea of synthetic data. Instead of collecting examples from the real world, models produce artificial data that mimics reality: text, images, videos, structured tables. It’s data made by algorithms, for algorithms.

The concept isn’t new. Synthetic data has long been used in robotics and autonomous driving to simulate rare or dangerous edge cases – helping teams test faster and safer. But in 2024, it became urgent. Ilya Sutskever warned we’ve hit peak real data. Elon Musk put it bluntly: “We’ve exhausted basically the cumulative sum of human knowledge in AI training.” The internet can’t feed these models forever.

Why synthetic data matters now:

It fills data gaps – especially for rare, risky, or domain-specific scenarios (like plane crashes or rare diseases).
It protects privacy – allowing companies to train on realistic-but-fake user data.
It cuts costs – no more expensive labeling or slow collection cycles.
It can reduce bias – by generating diverse, balanced data under human control.

So why not just generate unlimited synthetic data and call it a day?

Because bad synthetic data leads to model collapse – a loop of self-reinforcing errors. That’s why a new wave of techniques now focuses on controlling synthetic data – making sure it improves models rather than breaking them.

One emerging method is inference-time self-training: models generate outputs, critique them, and retrain on their best answers. It's a closed feedback loop – AI refining itself. But even this isn’t enough.

To get quality, we still need humans.

Companies like Anthropic and OpenAI do generate large volumes of synthetic data to train their models, but they also rely on platforms like Scale AI, Toloka and SuperAnnotate to incorporate human feedback – whether for ranking responses, labeling edge cases, or refining reward models.

Let’s look at how human-in-the-loop (HITL) workflows keep synthetic data grounded, useful, and safe – with examples from AI companies using it today.

The Engines of Creation: Methods for Generating Synthetic Data

Before we explore the solution, it's important to understand the tools. Synthetic data generation isn't a one-size-fits-all process. The method chosen depends entirely on the task at hand, from simple tables to complex, photorealistic worlds.

Join Premium members from top companies like Microsoft, Google, Hugging Face, a16z, Datadog plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI. Simplify your learning journey 👆🏼

Reply

or to participate.