Cross-Benchmark Generalization for Long-Horizon Agentic Tasks
Blog post from Surge AI
The text details an analysis of the capability transfer in reinforcement learning (RL) environments, emphasizing the importance of evaluating models beyond the training distribution to avoid overfitting and specialization. The study employed an RL environment from Surge AI, consisting of 27 task categories such as spreadsheets and document editing, and evaluated the transfer of learning on three external benchmarks: Toolathlon, τ²-Bench, and BFCL-V4. These benchmarks were chosen to cover a range of agentic capabilities and were disjoint from the training set. The Qwen3.5-122B-A10B model was trained using a two-stage pipeline involving supervised fine-tuning (SFT) followed by RL, addressing reward sparsity issues by implementing dense reward structures. The trained model demonstrated improved performance on all external benchmarks, performing comparably to GPT-5.5 at medium reasoning effort on Toolathlon and τ²-Bench, and even exceeding it on BFCL-V4 at pass@4. Behavioral changes in the model, such as parallel tool invocation and improved task closure, were noted, underscoring the model's enhanced ability to generalize beyond its training environment.