5 Large-Scale Datasets for AI Research

SA-1B (image)
SA-1B consists of 11M diverse, high-resolution, and privacy-protecting images collected and licensed from a third-party photo company. The images are photos taken from a camera, i.e. not artwork. The images vary in subject matter. Common themes of the images include locations, objects, and scenes. The dataset includes 1.1B high-quality segmentation masks collected with the Segment Anything Data Engine.
OIG-moderation (text)
OIG-moderation is a diverse dataset of dialogue that may be related to NSFW subject matters, abuse eliciting text, privacy violation eliciting instructions, depression or related content, hate speech, and other similar topics. The dataset consists of the [prosocial], [anthropic redteam], and subsets of [English Wikipedia] datasets along with other public datasets and data created or contributed by volunteers. To regularize the dataset there also are "regular" OIG instructions, which include Q/A instructions, coding instructions, and similar types of queries.
OIG-43M (text)
The Open Instruction Generalist (OIG) dataset is a large open-source instruction dataset that currently contains ~43M instructions. OIG is one of many chatbot datasets that LAION, along with its volunteers, Ontocord, Together and other members of the open-source community, have released and is intended to create equal access to chatbot technology.
Flan Collection (text)
“Flan 2022”, combines prior collections from FLAN, P3/T0, and Natural Instructions with new dialog, program synthesis, and complex reasoning tasks.
LAION-5B (text and image)
LAION-5B is a dataset of 5,85 billion CLIP-filtered image-text pairs, 14x bigger than LAION-400M, previously the biggest openly accessible image-text dataset in the world.

Every day we post helpful lists and bite-sized explanations on our Twitter. Please join us there!

— # (#)

5 Large-Scale Datasets for AI Research

Reply

Best AI Coding Tools in 2026: Assistants, Agents, IDEs & Open Models

The Org Age of AI: A Collection of Enterprise AI Adoption Guides

Is your security team ready for AI coding agents? Join us on July 14🛡️