CoderForge-Preview: SOTA open dataset for training efficient coding agents
Blog post from Together AI
CoderForge-Preview is the largest open dataset of coding agent trajectories available, comprising 258,000 test-verified trajectories from 51,000 tasks across 1,655 repositories, designed to address the shortage of high-quality open training data that limits the advancement of open-weight coding models. The dataset was used to train models with 32 billion and 4 billion parameters, achieving significant performance improvements, particularly with Qwen-3 32B, which ranked highest among open-data models in the ≤32B parameter range on SWE-Bench Verified. The dataset generation involved using Qwen3-Coder-480B and filtering through rejection sampling, resulting in 155,000 successful trajectories out of 258,000 generated. The data is drawn from sources like R2E-Gym, SWE-Smith, and SWE-Rebench, and is set within a standardized action/observation interface using the OpenHands scaffold. The work emphasizes training only on successful trajectories to enhance task resolution efficiency, and the dataset underwent a thorough license audit to ensure responsible use by retaining only trajectories under permissive open-source licenses. Despite its success, the dataset has limitations such as adaptability to different scaffolds, a focus on bug-fixing tasks, and lack of user interaction modeling, but it aims to drive further advancements in open-source AI development by providing a foundation for exploring agentic reinforcement learning and larger model scales.