DSGym: A holistic framework for evaluating and training data science agents
Blog post from Together AI
DSGym is a comprehensive framework developed to evaluate and train large language model (LLM)-based data science agents, addressing the limitations of existing benchmarks that assess isolated skills in varied environments. By integrating diverse data science evaluation suites into a single API, DSGym standardizes abstractions for datasets, agents, and metrics, thus facilitating fairer comparisons and reducing integration costs. The framework introduces novel scientific analysis tasks and modeling competitions, such as 90 bioinformatics tasks and 92 Kaggle competitions, to expand the evaluation scope. Beyond evaluation, DSGym supports agent training through trajectory generation and synthetic data pipelines, demonstrated by training a 4B model on 2,000 generated examples to achieve state-of-the-art performance. DSGym's design simplifies the addition of new tasks and evaluation scripts by using a modular approach, while its benchmarks reveal that many models rely on memorization instead of actual data analysis, particularly for general tasks. Through systematic investigation, DSGym aims to enhance the capability of data science agents to genuinely reason about data, rather than merely recall patterns.