RedPajama-Data-v2: An open dataset with 30 trillion tokens for training large language models

Post Details

Company

Together AI

Date Published

Oct. 30, 2023

Author

Together

Word Count

2,223

Language

English

Hacker News Points

1

Source URL

www.together.ai/blog/redpajama-data-v2

Summary

The RedPajama-Data-v2 dataset consists of 30 trillion tokens from 84 CommonCrawl dumps covering five languages, along with 40+ pre-computed data quality annotations. This release aims to lift the burden off the community by providing a pool of web data for extracting and filtering high-quality datasets for large language models. The dataset is built from the ground up using publicly available web data, consisting of source data, quality annotations, and deduplication clusters. It includes 100 billion text documents with 100+ trillion raw tokens, five languages, and all data processing scripts are open source and available on GitHub. The goal is to provide a foundation for creating high-quality datasets, and the optimal filtering of data depends on the intended use.