According to Gartner's prediction, by 2030, machine learning (ML) models are anticipated to use synthetic data exclusively, eliminating the need for real data. The synthetic data market is on a robust growth trajectory, as evidenced by Cognilytica's forecast. They project the market to expand from $110 million in 2021 to an impressive $1.15 billion by 2027. For a curated list of key research, see 9 papers on synthetic data generation for AI.
What is Synthetic Data and how to work with it? What makes Synthetic Data so in demand?
In this Token, we will answer these questions and explore the following:
The origins of synthetic data
Why you might need it?
How to generate synthetic data?
Considerations in creating a realistic synthetic dataset
Techniques for evaluating synthetic data quality
Use cases
Can I switch to synthetic data completely?
Conclusion
Bonus: A few companies that generate synthetic data
What Is Synthetic Data? Origins and Definition
You might have heard of the Caesar Cipher – a cryptographic technique used by Julius Caesar. It's one of the earliest and simplest methods for encrypting messages, ensuring that only authorized parties can decode them. Similarly, synthetic data was originally used for a related purpose: to conceal personally identifiable information (PII) within datasets. As the name suggests, synthetic data is artificially generated to mimic real-world data. It's created using algorithms, simulations, or predefined rules. Its main purpose is to protect privacy and confidentiality. Today, synthetic data is primarily used in:
research
training machine learning algorithms
data analysis
testing software products
Why you might need synthetic data?
The growing popularity of foundation models (FMs) and specifically large language models (LLMs), has exacerbated the demand for synthetic data. As we covered in the previous Token, these ginormous models are data-hungry and hence, require a huge volume of data for training. At some point, there might not be enough real data to satisfy their appetite.
Other reasons in favor of using synthetic data for FMs and regular ML models:
Bias: Datasets in ML are seldom balanced or unbiased. For instance, if you are running an online survey for users in a remote part of South Asian countries, most of the respondents will likely be male. If you use the same dataset for training, your model will be biased. In such cases, synthetic data could help generate responses for minority classes resulting in a balanced dataset which will likely reduce bias.
Cost and Time: Imagine owning a bank and needing to test the functionality for registering new users. To test it with real users, you would have to wait days or even months to acquire 1,000 new customers. However, using synthetic data could allow you to conclude the test within hours or a few days.
Testing and Validation: In developing ML models, especially FMs, it's crucial to test them under various scenarios, many of which might be rare or difficult to capture in real-world data. Synthetic data can be tailored to simulate these rare conditions, allowing for thorough testing and validation of the models.
Innovation and Experimentation: Synthetic data allows for greater flexibility in model development. Researchers can create data with specific attributes or conditions that may not be readily available in real datasets, thus pushing the boundaries of what's possible in ML research and application.
Regulations and Privacy: Government regulations dictate that in sensitive sectors, such as medicine and healthcare, providers must keep information private. Hence, before using the data for research, it is important to mask any pieces of data that could potentially link to specific individuals.
Now that we know about the benefits of synthetic data, let’s talk about how we can generate it.
How to generate synthetic data?
Actually, there are a lot of ways you can generate synthetic data. The underlying task will dictate the method. Here are some commonly used ways:
Simulation
This approach involves creating artificial data that mimics real-world scenarios using predefined rules or algorithms. Let’s say you want to create a synthetic dataset for a bank. Using the original data provided by the bank, you have identified some patterns like:
Young unemployed customers have less than $1000 in their bank.
Young employed customers have on average $5000 in their account and they regularly use credit cards.
Old customers have $20000+ in their bank accounts and have no credits.
Now, you could synthesize a new dataset by converting the above patterns into rules for a computer program.
One of the major advantages of using simulated data is that we have fine-grained control over the generated data.
Modeling
Today, 3D modeling software has become ubiquitous and advanced. This software can generate realistic views of our world. For example, we can use data from Google’s street view to train vision models. We could also create a 3D scene in software like Blender and use it to train vision models. Having a map of the 3D world can help us capture objects’ images from multiple perspectives.

Complex 3D city scene in Blender (credits: creativebloq)
Foundation models
Advanced models like ChatGPT and Google Bard can generate realistic data. If you want to train a model that can generate stories, prompt these FMs to generate stories and then use these stories to fine-tune a storytelling model or train a new model from scratch. The same technique can be applied to train vision models using data generated from DALLE or Stable Diffusion.
How to Create High-Quality Synthetic Data
To create synthetic data that truly replicates the intricacies of real data, you need to consider the following:
Domain Knowledge: Understanding the domain is crucial. For instance, suppose you work for a research organization developing vaccines, specifically for a disease related to chickenpox. A key feature is the number of times an individual was infected with chickenpox. When generating synthetic data, it's important to note that, except in rare cases, individuals should not be depicted as having had chickenpox more than once. This is because people who have had chickenpox typically develop immunity to the disease for the rest of their lives.
Properties of real data: The synthetic dataset should closely mimic the distribution and statistical properties of the real dataset. Otherwise, any analysis or model created using the synthetic dataset will likely fail when faced with the real data.
Relationships among variables: In real data, it is most likely that there will be features that are correlated. A very common example is the relationship between an individual's age and their income. It's essential to ensure that correlated features in the real dataset remain correlated in the synthetic dataset.
Prompt Engineering: Today, language models and vision models are powerful enough to generate realistic data. However, you should be specific with your prompt to generate data for your use case. If you want to generate stores for kids, your prompt to the model should specify that. Thus, “Generate 500 words story for kids aged 5 or below” is a way better prompt than “Generate a story for kids”. The earlier prompt will likely generate an age-appropriate story.
Techniques for evaluating synthetic data quality
To determine whether we have the right data (synthetic data, in this case), we need to quantify its usefulness. Fortunately, there are several metrics available for this purpose. Let's discuss them.
Comparison of statistical properties: We can compare statistical properties such as mean, standard deviation, earth movers distance, etc., of the synthetic data with those of the real dataset. If these properties are similar, the synthetic data is a good substitute for the real dataset.
GAN training: This technique is mostly used for vision models. In this technique, a classification model is trained using synthetic data, and its performance is evaluated using real-world data. This measure provides a measure of how far apart the generated and true samples are.
Human testing: This technique is similar to the Turing test. A random set consisting of both real and synthetic data is provided to human participants and the participants are asked to label each piece of data as either real or synthetic. If humans cannot distinguish between real and synthetic data, the synthetic data can be used as a substitute for real data.
Synthetic Data Use Cases: NLP, Computer Vision, and More
Synthetic data in Natural Language Processing (NLP)
Synthetic data is widely used in NLP these days. In GPoeT: a Language Model Trained for Rhyme Generation on Synthetic Data research, the experiment showed that a GPT-2 model fine-tuned using only 6 MB of synthetic data was able to generate poems with rhyming lines 60% of the time whereas, the same model fine-tuned using 142 MB of real data could generate rhyming lines only 11% of the time.
In another experiment, researchers at Google’s DeepMind found that synthetic data intervention can make language models robust to sycophancy.
Synthetic data in Computer Vision
Tesla is using synthetic data from a simulated 3D gaming environment to train its self-driving agent.

3D simulation for self-driving cars at Tesla
The paper "Procedural Image Programs for Representation Learning’ proposes using a large collection of procedural image programs for image representation learning, bypassing the need for real images. Training with these shaders outperforms existing methods not using real data and competes against natural images, especially in specialized tasks.
An experiment conducted by researchers from IBM, Georgia tech, MIT, etc. has shown that the use of synthetic data can assist existing models in understanding concepts beyond object nouns, such as attributes and relationships.
Synthetic data is already empowering state-of-the-art AI models and the demand will only grow.
Can I switch to synthetic data completely?
Since, a lot of companies and research institutes are benefiting from using synthetic data, should you use it as well? Actually, the answer is “it depends”. It depends on the actual task you want to accomplish.
ML models operate on a Garbage-In, Garbage-Out (GIGO) basis, meaning output quality heavily relies on input data quality. Thus, high-quality data, realistic or synthetic, is vital. Here are a few tips for using synthetic data effectively:
Models like ChatGPT still make logical errors: For tasks involving human logic, double-check the data (from ChatGPT or other LLMs) before using it to fine-tune or train your own model.
3D modeling software generates images that have vastly different textures than objects in real life: After training the model using data from this software, it’s better to fine-tune with real-life data.
Synthetic data may not contain all the representative features found in real life. Research conducted by scientists from Vanderbilt University has shown that synthetic survey data generated from ChatGPT has less variation in responses than in real surveys.
Unless you are sure that the synthetic data has all the properties of a real dataset, relying solely on it for training is unwise. A common scientific practice is to initially train with synthetic data, then fine-tune with real data. This approach is ideal for many applications.
Synthetic Data vs Real Data: When to Use Each
In this article, we discussed synthetic data, its significance, techniques for generating and evaluating synthetic data, uses of synthetic data in vision and language models, and whether or not we can train a model using only synthetic data.
Synthetic data helps us remove bias, save time and cost, protect privacy and comply with the regulations. It can be generated using simulation, modeling software, and AI models. With the advent of performant models like ChatGPT and complex simulation software, it is easy to generate synthetic data but we must evaluate the quality of generated data using either statistical properties, human evaluation, AI models like GAN, or some other techniques.
Leading companies and research institutes have already started using synthetic data for training state-of-the-art models but we must be careful and use it only if the synthetic data has all the properties of real data. Otherwise, it is best to limit the use of synthetic data to initial training and then the model can be fine-tuned with real datasets.
If you're looking for tools to get started, see our overview of 10 synthetic data generation companies for ML training
Bonus: A few companies that generate synthetic data
For a practitioner's perspective on synthetic data in robotics, see our interview:
That’s a fun video to watch:
Please give us feedback
Thank you for reading, please feel free to share with your friends and colleagues. In the next couple of weeks, we are announcing our referral program 🤍
Previously in the FM/LLM series:







