Company
Date Published
Author
Together
Word count
2223
Language
English
Hacker News points
1

Summary

The RedPajama-Data-v2 dataset consists of 30 trillion tokens from 84 CommonCrawl dumps covering five languages, along with 40+ pre-computed data quality annotations. This release aims to lift the burden off the community by providing a pool of web data for extracting and filtering high-quality datasets for large language models. The dataset is built from the ground up using publicly available web data, consisting of source data, quality annotations, and deduplication clusters. It includes 100 billion text documents with 100+ trillion raw tokens, five languages, and all data processing scripts are open source and available on GitHub. The goal is to provide a foundation for creating high-quality datasets, and the optimal filtering of data depends on the intended use.