• Turing Post
  • Posts
  • Token 1.14: What is Synthetic Data and How to Work with it?

Token 1.14: What is Synthetic Data and How to Work with it?

Will it eliminate the need for real data? Let's explore

According to Gartner's prediction, by 2030, machine learning (ML) models are anticipated to use synthetic data exclusively, eliminating the need for real data. The synthetic data market is on a robust growth trajectory, as evidenced by Cognilytica's forecast. They project the market to expand from $110 million in 2021 to an impressive $1.15 billion by 2027.

What is Synthetic Data and how to work with it? What makes Synthetic Data so in demand?

In this Token, we will answer these questions and explore the following:

  • The origins of synthetic data

  • Why you might need it?

  • How to generate synthetic data?

  • Considerations in creating a realistic synthetic dataset

  • Techniques for evaluating synthetic data quality

  • Use cases

  • Can I switch to synthetic data completely?

  • Conclusion

  • Bonus: A few companies that generate synthetic data

The origins

You might have heard of the Caesar Cipher – a cryptographic technique used by Julius Caesar. It's one of the earliest and simplest methods for encrypting messages, ensuring that only authorized parties can decode them. Similarly, synthetic data was originally used for a related purpose: to conceal personally identifiable information (PII) within datasets. As the name suggests, synthetic data is artificially generated to mimic real-world data. It's created using algorithms, simulations, or predefined rules. Its main purpose is to protect privacy and confidentiality. Today, synthetic data is primarily used in:

  • research

  • training machine learning algorithms

  • data analysis

  • testing software products

Why you might need synthetic data?

The growing popularity of foundation models (FMs) and specifically large language models (LLMs), has exacerbated the demand for synthetic data. As we covered in the previous Token, these ginormous models are data-hungry and hence, require a huge volume of data for training. At some point, there might not be enough real data to satisfy their appetite.

Other reasons in favor of using synthetic data for FMs and regular ML models:

  • Bias: Datasets in ML are seldom balanced or unbiased. For instance, if you are running an online survey for users in a remote part of South Asian countries, most of the respondents will likely be male. If you use the same dataset for training, your model will be biased. In such cases, synthetic data could help generate responses for minority classes resulting in a balanced dataset which will likely reduce bias.

  • Cost and Time: Imagine owning a bank and needing to test the functionality for registering new users. To test it with real users, you would have to wait days or even months to acquire 1,000 new customers. However, using synthetic data could allow you to conclude the test within hours or a few days.

  • Testing and Validation: In developing ML models, especially FMs, it's crucial to test them under various scenarios, many of which might be rare or difficult to capture in real-world data. Synthetic data can be tailored to simulate these rare conditions, allowing for thorough testing and validation of the models.

  • Innovation and Experimentation: Synthetic data allows for greater flexibility in model development. Researchers can create data with specific attributes or conditions that may not be readily available in real datasets, thus pushing the boundaries of what's possible in ML research and application.

  • Regulations and Privacy: Government regulations dictate that in sensitive sectors, such as medicine and healthcare, providers must keep information private. Hence, before using the data for research, it is important to mask any pieces of data that could potentially link to specific individuals.

Now that we know about the benefits of synthetic data, let’s talk about how we can generate it.

How to generate synthetic data?

Actually, there are a lot of ways you can generate synthetic data. The underlying task will dictate the method. Here are some commonly used ways:

The following explanation is available to our Premium users only → please Upgrade to have full access to this and other articles

That’s a fun video to watch:

Please give us feedback

Login or Subscribe to participate in polls.

Thank you for reading, please feel free to share with your friends and colleagues. In the next couple of weeks, we are announcing our referral program 🤍

Previously in the FM/LLM series:

Reply

or to participate.