TogetherCoder-Preview: SOTA Open Dataset for Training Efficient Agents

Post Details

Company

Together AI

Date Published

Feb. 5, 2026

Author

Alpay Ariyak*, Junda Zhang, Junxiong Wang, Shang Zhu, Federico Bianchi, Sanjana Srivastava, Ashwinee Panda, Siddhant Bharti, Chenfeng Xu, John Heo, Xiaoxia Shirley Wu, James Zhou, Percy Liang, Leon Song, Ce Zhang, Ben Athiwaratkun, Zhongzhu Zhou*, Qingyan

Word Count

3,143

Language

English

Hacker News Points

-

Source URL

www.together.ai/blog/togethercoder-preview

Summary

TogetherCoder-Preview is a significant advancement in open-source AI, offering the largest open dataset of coding agent trajectories, comprised of 161,000 test-verified trajectories across 54,000 tasks from 1,639 repositories. This initiative aims to address the critical limitation of high-quality open training data in the AI research community, enabling wide-scale research and development by making both dataset and model weights fully open. The dataset was curated using rejection sampling to ensure high quality, and the resulting models trained on it achieved notable performance, with the 32B model attaining a 59.4% pass rate on SWE-Bench Verified, ranking it highly among open-weight and open-data models. The project's methodology includes generating agent trajectories from diverse task sources and systematically filtering solutions for quality, supporting robust training of long-horizon coding agents. While the dataset exhibits significant scale and context length, the study acknowledges limitations such as adaptability to different scaffolds and scope beyond bug-fixing tasks, suggesting future efforts will involve exploring larger model scales and reinforcement learning enhancements.