Home / Companies / Together AI / Blog / Post Details
Content Deep Dive

RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens

Blog post from Together AI

Post Details
Company
Date Published
Author
Together
Word Count
1,032
Language
English
Hacker News Points
-
Summary

RedPajama` is a collaborative project aiming to create leading, fully open-source models, following in the footsteps of `Stable Diffusion`, which demonstrated the potential of open-source models. The project starts by reproducing the `LLaMA` training dataset of over 1.2 trillion tokens, with the goal of creating a set of high-quality, fully open-source models that can rival commercial offerings. The first component released is the pre-training data, which has been carefully filtered and processed to ensure its quality and broad coverage. The project involves multiple collaborations between organizations and researchers, including `Together`, `Ontocord.ai`, `ETH DS3Lab`, `Stanford CRFM`, and `Hazy Research`. The dataset consists of seven data slices, each with its own filtering process, and is available for download through `Hugging Face`. With the pre-training data released, the next step is to train a strong base model and instruction tune it using various tools and techniques. The project acknowledges the contributions of the open-source AI community and recognizes the potential of fully open-source models in removing limitations on research, customization, and sensitive data use.