The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix

Company

HuggingFace

Date Published

Nov. 3, 2025

Author

Asankhaya Sharma

Word count

1833

Language

Hacker News points

None

URL

huggingface.co/blog/codelion/optimal-dataset-mixing

Summary

The research explores the potential to train language models with significantly less data while maintaining high performance, specifically focusing on a GPT-2-sized model using only 1 billion tokens compared to the usual 10 billion. Through over 50 experiments, the team identified an optimal pre-training dataset mix of 50% finePDFs, 30% DCLM-baseline, and 20% FineWeb-Edu, which outperformed complex curriculum learning strategies in terms of validation and generalization performance, while also being more efficient. The results demonstrated that thoughtful dataset curation and static mixing strategies can achieve over 90% of the performance of models trained on much larger datasets, effectively challenging the assumption that more data results in better models. This approach not only reduces the computational costs and time but also highlights the importance of dataset quality and diversity in training robust models. The study also provided insights into the pitfalls of curriculum learning, such as catastrophic forgetting and overfitting, reinforcing the benefits of a consistent data distribution throughout the training process.