RedPajama training progress at 440 billion tokens

Company

Together AI

Date Published

April 24, 2023

Author

Together

Word count

1090

Language

English

Hacker News points

None

URL

www.together.ai/blog/redpajama-training-progress

Summary

We have made significant progress in training our RedPajama model, which aims to create leading open-source models. We trained a 1T base dataset, independent of the model architecture, and compared it to the Pile dataset used for open pre-training data. Our results show that the RedPajama dataset outperforms the Pile dataset with respect to certain benchmarks, particularly at higher token counts. The quality of our model checkpoints continues to improve as we train on more tokens, but we still lag behind LLaMA-7B in some metrics. We are excited about continuing to improve the data and exploring ways to combine it with other datasets. Our goal is to work with the open-source AI community to build the best large language models possible.