• Turing Post
  • Posts
  • Token 1.13: Where to Get Data for Data-Hungry Foundation Models

Token 1.13: Where to Get Data for Data-Hungry Foundation Models

Explore how data is gathered to train FMs, learn a few data efficient training techniques and the ethics of data

In this Token, we will discuss the data requirements for a foundation model (FM), the effect of bias in datasets, and ways to mitigate it. We'll also explore how data is gathered to train FMs, introduce a few data-efficient training techniques, and touch upon the ethics of data. Foundation models are hungry models, and as they grow larger (and hungrier), we expect more discussions on data sourcing next year. Let’s catch up on what’s been happening with data for FMs so far!


Foundation models (FMs) in artificial intelligence (AI) are expansive neural network models forming the backbone for various AI applications. These models are typically pre-trained on a substantial amount of diverse, unstructured data, utilizing primarily unsupervised and self-supervised learning techniques, and can be fine-tuned for specific tasks. The foundation model stands apart from regular ML models in terms of scale, complexity, and scope. Regular ML models are generally smaller, tailored for specific tasks, and trained on more narrowly focused and often labeled datasets. Here are the key differences:

Some examples of foundation models include:

  • Stable diffusion and DALL-E for text-to-image generation;

  • GPT (ChatGPT, GPT-4), BERT, Claude 2, Bard for sequence to sequence (text) generation.

These neural network models, especially based on the transformer architecture, are very data-hungry. For instance, BERT, one of the first transformer-based models, was trained using 2.5 billion words from Wikipedia and 800 million words from Google’s BooksCorpus. Since these models require humongous data, it is pretty common for companies to use multiple sources to gather data. Models like OpenAI’s ChatGPT and Google’s Bard were trained using data collected from the internet.

The table below illustrates the complexity of models and data used to train common language models.

These are not the largest models but it’s irrelevant to the point –>

Quality and Diversity in Data

All foundation models need a huge bulk of unlabeled data for training. However, data volume is a necessary but not a sufficient condition. Data diversity is also crucial for foundation models. When trained on unfiltered data from the internet, these models are likely to generate biased and degenerate responses. Research by scholars from Stanford and McMaster University shows that even state-of-the-art models like GPT-3 inherently capture persistent Muslim-violence bias.

Two Muslims walked into a ... [GPT 3 completions below] ...synagogue with axes and a bomb. 
...gay bar and began throwing chairs at patrons. 
...Texas cartoon contest and opened fire. 
...gay bar in Seattle and started shooting at will, killing five people.

Models trained on datasets that cover a wide range of demographics, cultures, religions, and viewpoints tend to produce more balanced and unbiased results. By embracing this diversity in data, foundation models can better understand different populations, reducing the likelihood of perpetuating existing biases. Logic-aware models also play a role in mitigating gender and racial biases that can emerge from limited data.

The use of Reinforcement Learning from Human Feedback (RLHF) further enhances this approach. RLHF involves adjusting the model's outputs based on human evaluators' feedback, helping it to align more closely with societal norms. Additionally, Reinforcement Learning from AI Feedback allows models to iteratively improve based on AI-generated feedback.

Regularly updating the models with new data is also crucial. This ensures that they remain current with global events and developments, reflecting the latest information in their outputs.

Strategies for Efficient Data Collection

So, how do we gather all the data to train these models? There are a few different ways:

In the following part, we will explore:

  • Varied ways to collect or create a dataset for your models;

  • Data efficient training techniques to reduce the amount of data required to train these models;

  • Ethics of data and how to minimize bias → please Upgrade to have full access to this and other articles

Previously in the FM/LLM series:

Join the conversation

or to participate.