We have made significant progress in training our RedPajama model, which aims to create leading open-source models. We trained a 1T base dataset, independent of the model architecture, and compared it to the Pile dataset used for open pre-training data. Our results show that the RedPajama dataset outperforms the Pile dataset with respect to certain benchmarks, particularly at higher token counts. The quality of our model checkpoints continues to improve as we train on more tokens, but we still lag behind LLaMA-7B in some metrics. We are excited about continuing to improve the data and exploring ways to combine it with other datasets. Our goal is to work with the open-source AI community to build the best large language models possible.