Company
Date Published
Author
Together
Word count
1032
Language
English
Hacker News points
None

Summary

RedPajama` is a collaborative project aiming to create leading, fully open-source models, following in the footsteps of `Stable Diffusion`, which demonstrated the potential of open-source models. The project starts by reproducing the `LLaMA` training dataset of over 1.2 trillion tokens, with the goal of creating a set of high-quality, fully open-source models that can rival commercial offerings. The first component released is the pre-training data, which has been carefully filtered and processed to ensure its quality and broad coverage. The project involves multiple collaborations between organizations and researchers, including `Together`, `Ontocord.ai`, `ETH DS3Lab`, `Stanford CRFM`, and `Hazy Research`. The dataset consists of seven data slices, each with its own filtering process, and is available for download through `Hugging Face`. With the pre-training data released, the next step is to train a strong base model and instruction tune it using various tools and techniques. The project acknowledges the contributions of the open-source AI community and recognizes the potential of fully open-source models in removing limitations on research, customization, and sensitive data use.