Home / Companies / Together AI / Blog / Post Details
Content Deep Dive

CoderForge-Preview: SOTA open dataset for training efficient coding agents

Blog post from Together AI

Post Details
Company
Date Published
Author
Alpay Ariyak*, Junda Zhang, Junxiong Wang, Shang Zhu, Federico Bianchi, Sanjana Srivastava, Ashwinee Panda, Siddhant Bharti, Chenfeng Xu, John Heo, Xiaoxia Shirley Wu, James Zou, Percy Liang, Leon Song, Ce Zhang, Ben Athiwaratkun, Zhongzhu Zhou*, Qingyang
Word Count
3,083
Language
English
Hacker News Points
-
Summary

CoderForge-Preview is the largest open dataset of coding agent trajectories available, comprising 258,000 test-verified trajectories from 51,000 tasks across 1,655 repositories, designed to address the shortage of high-quality open training data that limits the advancement of open-weight coding models. The dataset was used to train models with 32 billion and 4 billion parameters, achieving significant performance improvements, particularly with Qwen-3 32B, which ranked highest among open-data models in the ≤32B parameter range on SWE-Bench Verified. The dataset generation involved using Qwen3-Coder-480B and filtering through rejection sampling, resulting in 155,000 successful trajectories out of 258,000 generated. The data is drawn from sources like R2E-Gym, SWE-Smith, and SWE-Rebench, and is set within a standardized action/observation interface using the OpenHands scaffold. The work emphasizes training only on successful trajectories to enhance task resolution efficiency, and the dataset underwent a thorough license audit to ensure responsible use by retaining only trajectories under permissive open-source licenses. Despite its success, the dataset has limitations such as adaptability to different scaffolds, a focus on bug-fixing tasks, and lack of user interaction modeling, but it aims to drive further advancements in open-source AI development by providing a foundation for exploring agentic reinforcement learning and larger model scales.