- Turing Post
- Posts
- Guest Post: Open Recommender System Datasets – A Current Landscape
Guest Post: Open Recommender System Datasets – A Current Landscape
In this guest post, Avi Chawla, Founder of Daily Dose of Data Science and author of AIport, spotlights Yambda-5B – a rare, production-scale recommender dataset newly open to public research – and shows how it complements classic datasets like MovieLens, Amazon, and Spotify by addressing limitations in scale, modality, and evaluation.
Recommender systems thrive on data.
However, the data used in academic research often looks nothing like the data that fuels real-world recommenders since that data sits locked inside companies, due to both business value and privacy concerns.
Yandex recently published its 5B event dataset, Yambda-5B, on Hugging Face, making it publicly available to anyone working on recommender algorithms, so I decided to provide a short overview of recommender system datasets openly available to researchers and developers. Below are the most noteworthy datasets in this field.
Over the years, the RecSys community has relied on a handful of public datasets (depicted above) as benchmarks. Each has contributed to research progress, but each comes with limitations. For instance:
MovieLens: Contains user-provided movie ratings (1–5 stars) with timestamps. Its small scope (~10k movies total) made it great for early studies, but it’s not representative of industrial-scale catalogs.
Netflix Prize: ~100M movie ratings from Netflix’s 2006–09 recommendation challenge. Despite its historic role in advancing recommender research, it covers only ~17k movies and uses only coarse date timestamps. It’s also a one-time snapshot from the mid-2000s with no updates.
Yelp Open: 8.6M reviews of local businesses (restaurants, shops, etc.) by 2.2M users. It’s useful for experiments, but the data is extremely sparse and limited to a few cities. No standard train/test split is provided (researchers devise their own evaluation schemes).
Last.fm (LFM-1B): Approximately 1B music listening events (“scrobbles”) from the LastFM online music service. It was once widely used for music recommendation research. However, due to licensing restrictions, LFM-1B (and an even larger LFM-2B version) is no longer publicly accessible.
Criteo 1TB: A terabyte of ad click logs (over 4 billion interactions). This dataset reflects true industry scale and is used to train click-through rate models. But it doesn’t resemble typical recommendation data: it has no meaningful user or item metadata (only hashed IDs) and no timestamps.
Spotify Million Playlist: 1 million user-generated playlists (~66 million track entries) released for the RecSys Challenge 2018. This dataset is excellent for studying short-term preferences and sequence modeling, but it doesn’t include long-term user histories or any explicit feedback.
Amazon Reviews: 200M+ product reviews from Amazon across many categories. This dataset is rich in content and has been used for product recommendation and sentiment analysis research. But it’s extremely sparse and has a long-tail distribution, i.e., most users and products only have a few interactions.
Yambda-5B mitigates these challenges by offering researchers large-scale, anonymized data from Yandex’s music streaming service, that includes parameters like the is_organic flag and Global Temporal Split (GTS) evaluation.
Let’s examine the problems this dataset can help solve.
Problem 1: Lack of real-world datasets
Modern internet platforms log billions of user interactions every year, far beyond the size of classic academic datasets.
An algorithm that looks SOTA on a million-rating dataset might break or underperform when faced with a billion-event stream:
Yambda-5B contains 4.79 billion user-item interactions, which is orders of magnitude more data than MovieLens or Netflix.
And despite being extremely large, the dataset is accessible to different research budgets because Yandex released multiple versions – a 50 million interaction sample, a 500 million sample, and the full 5 billion, so you can start small and scale up as needed.
Problem 2: Privacy
Sharing real user behavior logs is tricky, even if you anonymize user IDs, since people can sometimes be re-identified by just a few unique preferences.
A famous example is the Netflix Prize dataset of 100 million movie ratings: it was released for a competition, but researchers showed it was possible to de-anonymize and identify individual users by matching their ratings with public IMDb reviews.
Netflix even canceled a follow-up contest in 2010 after a privacy lawsuit highlighted these risks:
Yambda-5B is also special in this aspect that, unlike Netflix, this dataset contains no publicly accessible listening histories and likes, making it inherently resistant to de-anonymization.
It is rigorously safeguarded, so there’s zero risk of sensitive data exposure.
More key features
Importantly, Yambda includes both implicit feedback (song listens, skips) and explicit feedback (track “likes” or “dislikes”), so models can learn from both passive behavior and active preferences.
Each interaction is labeled with an is_organic flag indicating whether the play was an organic user action or triggered by the recommendation engine:
This lets researchers separate natural listening behavior from recommendation-driven behavior, which is crucial for evaluating algorithmic impact.
Unlike most older datasets, Yambda provides precise timestamps for all events and comes with a global temporal split for model evaluation, which lets us train on earlier interactions and test on a held-out set of later interactions:
Evaluating on this time-based split (rather than random hold-outs) gives a more realistic measure of how a model might perform in an online setting.
Another unique aspect is that Yambda is multi-modal: it ships with precomputed audio embeddings for over 7.7 million tracks, enabling content-aware recommendation strategies out of the box.
The release even includes baseline models and evaluation code, with metrics like NDCG@K and Recall@K reported, to help researchers get started and compare methods on a standard benchmark.
Conclusion
Historically, we haven’t had many large-scale open datasets, which made it challenging to benchmark algorithms intended for real-world use.
Yandex’s Yambda-5B is a significant step toward bridging that gap, offering a web-scale dataset that academia can freely explore.
If you’re interested in exploring Yambda-5B yourself, it’s available on Hugging Face here: Hugging Face Yambda dataset.
With resources like this becoming available, we can move closer to recommender models that truly translate from paper to production.
Thanks for reading!
*This post was written by Avi Chawla, Founder of Daily Dose of Data Science and AIport author, specially for Turing Post.
Reply