Turing Post
Posts
5 Large-Scale Datasets for AI Research #2

5 Large-Scale Datasets for AI Research #2

These Diverse and Challenging Sets of Data Will Help AI Systems Learn and Grow

Valeriia Kuka
August 06, 2023

MineDojo (videos and text)

MineDojo features a massive database collected automatically from the internet.

730K+ narrated Minecraft videos, which add up to ~300K hours and 2.2B words in English transcripts.
Text, images, tables, and diagrams from ~7K pages.
340K+ Reddit posts along with 6.6M comments under the “r/Minecraft” subreddit.

ROOTS (text)

The Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that were used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model.

Natural-Instructions-v2 (text)

The goal of the Natural-Instructions project is to provide a good quality benchmark for measuring generalization to unseen tasks. V1.x dataset consists of 61 tasks and leverages the crowdsourcing templates of existing NLP datasets. The v2.x dataset is built upon the earlier work, has a simpler schema, and contains over 1.5k tasks.

Anthropic Helpfulness Dataset (text)

Human preference data about helpfulness and harmlessness from Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. These data are meant to train preference (or reward) models for subsequent RLHF training. These data are not meant for the supervised training of dialogue agents. Training dialogue agents on these data is likely to lead to harmful models and this should be avoided.

LAION-115M (text and image)

This dataset is a version of laion-400m with filtering/replacement of noisy/inaccurate captions with captions generated via the BLIP model.

Every day we post helpful lists and bite-sized explanations on our Twitter. Please join us there!

5 open-source datasets used to train LLMs
1. MineDojo (videos and text)
2. ROOTS (text)
3. NaturalInstructions-v2 (text)
4. Anthropic Helpfulness dataset (text)
5. LAION-115M(text and image)
Links 🧵
— TuringPost (@TheTuringPost)
5:11 PM • May 18, 2023

Reply

or to participate.