Conversations about using synthetic data in AI are becoming more popular every day. What is synthetic data? It is artificial data created using various algorithms and techniques. It is not collected from real-life environments, processes, or events, it only mimics real-world data.
Synthetic data can be generated quickly and at scale, and it is more cost-effective than real data. By using this data, researchers can also simulate rare scenarios. Many researchers use synthetic data to train, test, and validate LLMs and other AI systems.
Here is a list of notable research papers about synthetic data generation in AI:
“Comprehensive Exploration of Synthetic Data Generation: A survey” reviews 417 Synthetic Data Generation (SDG) models from the past decade. It provides information about their types and functionalities. This paper can help with SDG model selection. → Read more
"Best Practices and Lessons Learned on Synthetic Data for Language Models" provides an overview of synthetic data research, challenges, and applications. It also discusses the potential future use of synthetic data, highlighting the need for responsible use to empower LLMs. → Read more
"Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe" explains how to create synthetic data with formal privacy guarantees, such as differential privacy (DP). It demonstrates that fine-tuning a pre-trained generative LLM with DP can generate high-quality synthetic text with strong privacy protection. → Read more
“Nemotron-4 340B Technical Report” by NVIDIA introduces a new Nemotron model family and highlights that 98% of data used in this model alignment process is synthetically generated. It also gives an open access to the synthetic data generation pipeline used for this model. → Read more
“DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows” offers an open-source Python library designed to help researchers implement LLMs workflows, that often involve synthetic data generation. → Read more
“Generative AI for Synthetic Data Generation: Methods, Challenges and the Future” explores advanced technologies for the generation of task-specific training data in LLMs, its applications and limitations. → Read more
"Scaling Synthetic Data Creation with 1,000,000,000 Personas" by Tencent AI Lab discusses the use of the Persona Hub, a collection of diverse personas, and explains how to generate these personas. → Read more
“LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives” by Cohere for AI introduces an active inheritance concept of data generation, where smaller LLMs learn from larger models. → Read more
“Token 1.14: What is Synthetic Data and How to Work with It?” explores the origins of synthetic data and discusses how to make it realistic and how to evaluate the quality of such datasets. This article also raises the question of a complete switch to synthetic data. → Read more
