This website uses cookies

Read our Privacy policy and Terms of use for more information.

Since late 2023, AI developers and experts, such as Ilya Sutskever, have started talking about the limits of real data. We've reached a peak, and there isn't enough web data left to train models, so we need to find new ways to replenish datasets. Most are betting on the possibilities we can unlock using synthetic data — data generated by models themselves that can be used at different stages of their training. The idea of humans guiding machines goes back further than most think — we traced it from 1945 Memex to ambient intelligence in our weekly digest: FOD#93: When AI Meant Ambient Intelligence

It sounds simple: why not just generate more data? But it's not that easy. Synthetic data often lacks accuracy, and training models on low-quality outputs can lead to model collapse – a serious degradation in performance. In short, we’re nowhere near a world where AI can generate training data without oversight.

So what’s the practical path forward?

Today, we’ll explore how AI teams are using human-in-the-loop (HITL) systems to make synthetic data useful and safe. We’ll break down how humans guide and validate the process – and show real-world examples of how this is being implemented today.

In today’s episode, we will cover:

  • What synthetic data is – and why it needs a human touch

  • The Engines of Creation: Methods for Generating Synthetic Data

  • The Ghost in the Machine: The Specter of Model Collapse

  • The Human Lifeline: How HITL Grounds Synthetic Data

  • In the Trenches: How Real Companies Use Synthetic Data + HITL

    • OpenAI’s GPT-4.5

    • Microsoft and Phi-4 training strategy

    • Walmart experience

    • NVIDIA's Cosmos and the Future of Computer Vision

  • Conclusion

  • Sources and further reading

What synthetic data is – and why it needs a human touch

AI generating data for itself – that’s the core idea of synthetic data. Instead of collecting examples from the real world, models produce artificial data that mimics reality: text, images, videos, structured tables. It’s data made by algorithms, for algorithms.

The concept isn’t new. Synthetic data has long been used in robotics and autonomous driving to simulate rare or dangerous edge cases – helping teams test faster and safer. But in 2024, it became urgent. Ilya Sutskever warned we’ve hit peak real data. Elon Musk put it bluntly: “We’ve exhausted basically the cumulative sum of human knowledge in AI training.” The internet can’t feed these models forever.

Why synthetic data matters now:

  • It fills data gaps – especially for rare, risky, or domain-specific scenarios (like plane crashes or rare diseases).

  • It protects privacy – allowing companies to train on realistic-but-fake user data.

  • It cuts costs – no more expensive labeling or slow collection cycles.

  • It can reduce bias – by generating diverse, balanced data under human control.

So why not just generate unlimited synthetic data and call it a day?

Because bad synthetic data leads to model collapse – a loop of self-reinforcing errors. That’s why a new wave of techniques now focuses on controlling synthetic data – making sure it improves models rather than breaking them.

One emerging method is inference-time self-training: models generate outputs, critique them, and retrain on their best answers. It's a closed feedback loop – AI refining itself. But even this isn’t enough.

To get quality, we still need humans.

Companies like Anthropic and OpenAI do generate large volumes of synthetic data to train their models, but they also rely on platforms like Scale AI, Toloka and SuperAnnotate to incorporate human feedback – whether for ranking responses, labeling edge cases, or refining reward models. Toloka CEO Olga Megorskaya breaks down exactly why pure synthetic data always hits a ceiling — and what human co-agency with AI agents looks like from the inside. → Interview

Let’s look at how human-in-the-loop (HITL) workflows keep synthetic data grounded, useful, and safe – with examples from AI companies using it today.

The Engines of Creation: Methods for Generating Synthetic Data

Before we explore the solution, it's important to understand the tools. Synthetic data generation isn't a one-size-fits-all process. The method chosen depends entirely on the task at hand, from simple tables to complex, photorealistic worlds.

  • Statistical Methods: This is the classic approach. By analyzing the statistical properties of a real dataset (like its mean, standard deviation, or correlation between columns), we can generate new data that mimics these properties. Using known distributions like Normal or Poisson, these methods are excellent for creating simple, structured tabular data for analytics or testing. However, they require statistical expertise and often fail to capture the complex, non-linear relationships present in more sophisticated data.

  • Generative Adversarial Networks (GANs): GANs introduced a revolutionary "cat-and-mouse" game to data generation. They consist of two neural networks: a Generator that creates fake data and a Discriminator that tries to tell the fake data from real data. The two train in opposition; the Generator gets better at creating realistic fakes, and the Discriminator gets better at spotting them. This adversarial process produces highly realistic data, especially for images, but GANs can be notoriously difficult to train and control.

  • Variational Autoencoders (VAEs): VAEs take a different approach. They first compress real data into a simplified, lower-dimensional representation (a latent space), capturing its most essential features. Then, they use a decoder to reconstruct the data from this compressed space, introducing slight variations along the way. This makes VAEs particularly good for generating diverse new versions of existing data, like creating variations of a product image or exploring different artistic styles.

  • Transformer Models: Today, the powerhouse of synthetic data generation is the transformer architecture, the same technology behind models like GPT and Claude. Transformers excel at understanding sequences and context, making them unparalleled for generating coherent and contextually rich text, code, and even complex structured data. They can write realistic product reviews, generate synthetic conversations for training chatbots, or create complex tabular datasets for financial modeling.

The Ghost in the Machine: The Specter of Model Collapse

With these powerful tools, why not just create infinite data and build omniscient AI? The answer lies in a dangerous phenomenon known as model collapse.

Model collapse occurs when an AI model is trained predominantly on its own synthetic outputs. It’s an incestuous feedback loop of self-degradation. The model generates data, is trained on it, and then generates slightly less diverse, more error-prone data in the next cycle. Each generation amplifies the biases and artifacts of the previous one. Over time, the model forgets the richness and unpredictability of reality, and its outputs converge into a bland, repetitive, and often factually incorrect mush. Its understanding of the world starts to "collapse" inward.

Researchers have demonstrated that as models feed on their own tails, they begin to lose information about the true data distribution, eventually leading to a catastrophic decline in performance.

The Human Lifeline: How HITL Grounds Synthetic Data

This is where Human-in-the-Loop (HITL) becomes the indispensable solution. Here’s exactly how HITL is used to keep synthetic data in check:

  1. Validating and Curating Synthetic Data
    Using synthetic data blindly is like building a house on a shaky foundation. The first human checkpoint is quality control. A model might generate thousands of synthetic images of a rare manufacturing defect, but some might be physically impossible. It could produce thousands of lines of medical data, but some might contain nonsensical symptoms.

    The HITL workflow here is a continuous loop: generation → human review → correction → curation. Human experts vet the generated datasets, discarding unrealistic examples, correcting factual errors, and flagging subtle artifacts. This ensures that only high-quality, realistic data makes it into the final training set, preventing the model from learning flawed patterns.

  2. Labeling and refining data

    Data annotation is one of the most time-consuming parts of AI development. HITL offers a powerful shortcut. Instead of labeling from scratch, models can perform pre-labeling. For instance, an AI can generate initial bounding boxes on images or suggest sentiment labels for text. Human annotators then simply review and correct these synthetic "guesses." This "human-as-editor" approach drastically speeds up the labeling process while leveraging human expertise for final accuracy. The refined data is then used to fine-tune the model or for knowledge distillation into smaller, more efficient models.

  1. Reinforcement learning from human feedback (RLHF)

    Perhaps the most sophisticated HITL strategy is RLHF, which directly aligns a model's behavior with human preferences. The process is elegant:

    1. A model generates multiple responses to a prompt.

    2. A human evaluator ranks these responses from best to worst based on criteria like helpfulness, accuracy, or harmlessness.

    3. This ranking data is used to train a separate "reward model," which learns to predict which outputs humans will prefer.

    4. The original AI is then fine-tuned using this reward model, reinforcing it for generating high-scoring answers.

    This feedback loop directly teaches the model what "good" looks like from a human perspective. It’s the reason why models like ChatGPT and Claude can conduct nuanced, helpful conversations instead of just predicting the next word. A simple example is when you tell a chatbot, "You're wrong, correct that answer." You are providing direct feedback that, in aggregate, helps shape its future behavior.

All the stated approaches work as a feedback loop: the AI generates new data or answers, and humans guide it by weeding out bad synthetic data, injecting real-world context, and aligning the AI’s direction with domain knowledge and ethical norms.

In the Trenches: How Real Companies Use Synthetic Data + HITL

Synthetic data and HITL combination has being applied in a variety of ways across the AI landscape. Here are some of the most interesting examples:

OpenAI’s GPT-4.5

IIn February 2025, OpenAI released GPT-4.5 and quietly addressed a problem that had haunted its predecessor: sycophancy. GPT-4 was criticized for being too agreeable—echoing user opinions rather than offering truthful or balanced answers. The issue stemmed from reward models trained on biased human feedback, where agreement was often rated higher than accuracy.

GPT-4.5 took a different path. It was trained on synthetic data generated by smaller models—then refined through human-in-the-loop (HITL) feedback. Human reviewers ranked, corrected, and debated model outputs to better train the reward models. This new pipeline prioritized clarity, nuance, and robustness over flattery.

By combining AI-generated examples with more carefully structured human oversight, GPT-4.5 became sharper, more steerable, and less likely to just tell users what they wanted to hear (though it’s still very kind to me!)

Image Credit: OpenAI’s “Introducing GPT-4.5” blog post

Microsoft and Phi-4 training strategy

Microsoft's Phi-4, a powerful 14-billion-parameter small language model, is a real-world masterclass in data curation. The team’s strategy prioritized data quality over sheer model size, with synthetic data at its very core. Over 50 meticulously crafted synthetic datasets were used across pre-training, mid-training, and post-training alignment. These datasets, generated via multi-agent prompting and self-revision workflows, were designed to teach skills like Chain-of-Thought (CoT) reasoning. For example, a complex math solution might be synthetically rewritten step-by-step to teach the model the process of reasoning, not just the answer.

However, Phi-4’s training shows that even the best synthetic data needs to be grounded in reality. The final pre-training mixture included highly filtered web data, code, and academic papers, both as direct training material and as seeds for synthetic generation:

Image Credit: Phi-4 Technical Report

The HITL component was most prominent in post-training alignment using a specialized form of Direct Preference Optimization (DPO):

  • Pivotal Token Search (PTS): Researchers identified specific tokens (words or parts of words) in a response that determine its success or failure. They then synthetically created DPO pairs—one response with a "good" pivotal token and one with a "bad" one—and trained Phi-4 to prefer the good version. This is human-led micro-surgery on the model's outputs.

  • GPT-4o-Judged Comparisons: To scale up preference data, Microsoft used GPT-4o to judge and label pairs of outputs from different models. This AI-driven feedback, guided by human-defined criteria, created hundreds of thousands of synthetic preference pairs to fine-tune Phi-4.

Crucially, synthetic data was also used to teach Phi-4 when to say "I don't know." The team generated examples for unanswerable questions where the correct synthetic response was a polite refusal. This directly combats hallucinations, a key weakness of many models. The results were stunning: Phi-4 outperformed much larger models like GPT-4o on several reasoning and coding benchmarks, proving that a human-curated, synthetic-heavy data strategy can be incredibly effective.

Walmart experience

In the business world, HITL and synthetic data are solving practical problems. Retail giant Walmart, for instance, has explored using synthetic customer behavior data to train its recommendation engines. Researchers created simulated shopping sessions – sequences of user actions like viewing a product or adding it to a cart – to train their Triple Modality Fusion (TMF) model.

While the sequences were synthetic, the recommendation targets (the "next item to buy") were chosen by human experts to reflect logical, real-world shopping patterns. This human-guided approach allows Walmart to test and train its models on a vast range of shopping scenarios without using real customer data, protecting privacy while generating insightful, commercially valuable predictions.

NVIDIA's Cosmos and the Future of Computer Vision

We’ve explored how NVIDIA’s Cosmos models work in details in our articles about World Models. However, there are some other interesting moments, that show Cosmos Transfer model, using ground-truth simulation from NVIDIA Omniverse (a platform for real-time 3D simulation), from the perspective of synthetic data + HITL use case.

The Cosmos Transfer model takes inputs like segmentation maps, depth maps, and trajectories from a simulation and turns them into high-quality, photorealistic video frames and sensor readings. Developers can generate diverse camera views, LiDAR sweeps, and scenarios (like different weather or traffic conditions) artificially. Such synthetic data is particularly useful for "post-training," where existing models are further trained or fine-tuned on scenarios that are challenging to capture in the real world, like rare or dangerous situations.

Image Credit: NVIDIA Cosmos blog post

Human engineers and domain experts stay in the loop by controlling and curating the simulation scenarios. NVIDIA provides Blueprints in Omniverse that allow developers to vary parameters like weather, lighting, and actor behaviors in the synthetic environment. A human designer might specify, for example, a night-time rain scenario and generate many variations of it. This expert input ensures the synthetic dataset covers relevant corner cases and is realistic. After generating the data, humans, for example, autonomous driving testers or robotics engineers, also evaluate model performance on these synthetic scenarios – effectively validating whether the robot or car behaves correctly in simulation.

Conclusion: The New Role of the Human Data Expert

The landscape of AI development in 2024 and beyond is defined by a new synergy. With the well of high-quality real-world data running dry, relying solely on human-collected datasets is no longer viable. We’ve entered an era where synthetic data is essential for building more advanced, capable, and safe AI systems.

But as the examples from Microsoft, NVIDIA, and others clearly demonstrate, AI alone isn’t the answer. The risk of model collapse is real, and the path to more capable AI is paved with human judgment. Humans remain the most critical part of the workflow – not as data collectors, but as curators, directors, and ethicists. They set the goals, validate the outputs, and provide the crucial grounding in reality that prevents algorithms from drifting into a useless, self-reinforcing fiction.

As a result, dataset developers are increasingly stepping into leadership and management roles, shaping not just the data, but the entire direction of AI development. It’s another confirmation, that the future of AI is not a fully automated one – it is a collaborative one, where human intelligence guides machine-scale generation to unlock new frontiers of possibility.

Sources and further reading

Resources from Turing Post

Reply

Avatar

or to participate

Keep Reading