RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens

Company

Together AI

Date Published

April 17, 2023

Author

Together

Word count

1032

Language

English

Hacker News points

None

URL

www.together.ai/blog/redpajama

Summary

RedPajama` is a collaborative project aiming to create leading, fully open-source models, following in the footsteps of `Stable Diffusion`, which demonstrated the potential of open-source models. The project starts by reproducing the `LLaMA` training dataset of over 1.2 trillion tokens, with the goal of creating a set of high-quality, fully open-source models that can rival commercial offerings. The first component released is the pre-training data, which has been carefully filtered and processed to ensure its quality and broad coverage. The project involves multiple collaborations between organizations and researchers, including `Together`, `Ontocord.ai`, `ETH DS3Lab`, `Stanford CRFM`, and `Hazy Research`. The dataset consists of seven data slices, each with its own filtering process, and is available for download through `Hugging Face`. With the pre-training data released, the next step is to train a strong base model and instruction tune it using various tools and techniques. The project acknowledges the contributions of the open-source AI community and recognizes the potential of fully open-source models in removing limitations on research, customization, and sensitive data use.