DSGym: A holistic framework for evaluating and training data science agents

Post Details

Company

Together AI

Date Published

Jan. 26, 2026

Author

Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, James Zou

Word Count

1,270

Language

English

Hacker News Points

-

Source URL

www.together.ai/blog/dsgym

Summary

DSGym is a comprehensive framework developed to evaluate and train large language model (LLM)-based data science agents, addressing the limitations of existing benchmarks that assess isolated skills in varied environments. By integrating diverse data science evaluation suites into a single API, DSGym standardizes abstractions for datasets, agents, and metrics, thus facilitating fairer comparisons and reducing integration costs. The framework introduces novel scientific analysis tasks and modeling competitions, such as 90 bioinformatics tasks and 92 Kaggle competitions, to expand the evaluation scope. Beyond evaluation, DSGym supports agent training through trajectory generation and synthetic data pipelines, demonstrated by training a 4B model on 2,000 generated examples to achieve state-of-the-art performance. DSGym's design simplifies the addition of new tasks and evaluation scripts by using a modular approach, while its benchmarks reveal that many models rely on memorization instead of actual data analysis, particularly for general tasks. Through systematic investigation, DSGym aims to enhance the capability of data science agents to genuinely reason about data, rather than merely recall patterns.