Home / Companies / Together AI / Blog / Post Details
Content Deep Dive

TogetherCoder-Preview: SOTA Open Dataset for Training Efficient Agents

Blog post from Together AI

Post Details
Company
Date Published
Author
Alpay Ariyak*, Junda Zhang, Junxiong Wang, Shang Zhu, Federico Bianchi, Sanjana Srivastava, Ashwinee Panda, Siddhant Bharti, Chenfeng Xu, John Heo, Xiaoxia Shirley Wu, James Zhou, Percy Liang, Leon Song, Ce Zhang, Ben Athiwaratkun, Zhongzhu Zhou*, Qingyan
Word Count
3,143
Language
English
Hacker News Points
-
Summary

TogetherCoder-Preview is a significant advancement in open-source AI, offering the largest open dataset of coding agent trajectories, comprised of 161,000 test-verified trajectories across 54,000 tasks from 1,639 repositories. This initiative aims to address the critical limitation of high-quality open training data in the AI research community, enabling wide-scale research and development by making both dataset and model weights fully open. The dataset was curated using rejection sampling to ensure high quality, and the resulting models trained on it achieved notable performance, with the 32B model attaining a 59.4% pass rate on SWE-Bench Verified, ranking it highly among open-weight and open-data models. The project's methodology includes generating agent trajectories from diverse task sources and systematically filtering solutions for quality, supporting robust training of long-horizon coding agents. While the dataset exhibits significant scale and context length, the study acknowledges limitations such as adaptability to different scaffolds and scope beyond bug-fixing tasks, suggesting future efforts will involve exploring larger model scales and reinforcement learning enhancements.