Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Scaling Pedagogical Pre-training: From Optimal Mixing to 10 Billion Tokens

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Asankhaya Sharma
Word Count
4,656
Language
-
Hacker News Points
-
Summary

The article explores the development of Sutra-10B, a pedagogical pre-training dataset containing 10 billion tokens, designed to enhance the performance of language models through optimized content mixing strategies. The research builds on previous experiments that identified a static mix of textbook-quality PDFs, filtered web content, and educational resources as superior to complex curriculum strategies, achieving high performance with less data. The Sutra framework generates educational content using a knowledge graph that defines curriculum structures and incorporates diverse content styles to maintain data variety and quality. Despite improvements in model perplexity during training with SmolLM2-70M, the study highlights the limitations of small models in encoding extensive knowledge, emphasizing that model size ultimately constrains performance more than data quantity or quality. The article suggests that future efforts should focus on training larger models with structured curricula to further leverage the potential of high-quality datasets like Sutra-10B.