Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Code Concepts: A Large-Scale Synthetic Dataset Generated from Programming Concept Seeds

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Joseph Jennings and Brandon Norick
Word Count
710
Language
-
Hacker News Points
-
Summary

In the development of large-scale language models, improving model quality requires both quantity and quality of data, with a focus on specificity to enhance particular skills. A new approach called concept-driven synthetic data generation has been introduced to create datasets aligned with desired model capabilities, demonstrated through the Nemotron-Pretraining-Code-Concepts subset of the Nemotron-Pretraining-Specialized-v1.1 dataset. This method generated approximately 15 million Python programming problems, guided by a curated taxonomy of programming knowledge, to improve foundational programming skills in language model pretraining. The inclusion of this dataset in the final 100 billion tokens of the Nemotron-Nano-v3 pretraining led to a six-point improvement in the HumanEval benchmark. The workflow enables targeted data generation, allowing control over difficulty, diversity, and conceptual balance, and its success is validated by qualitative and quantitative improvements in model performance across varied programming concepts. The dataset and taxonomy are released under a permissive open license to encourage further application and extension in other domains.