Tipping the scales: Merging weak agents into a state-of-the-art deep researcher
Blog post from AI21 Labs
DeepResearch Bench II (DRB II) is a benchmark that evaluates deep research agents against 9,430 expert-written rubrics across 132 tasks, emphasizing Information Recall as a key metric. Rather than focusing on creating a superior individual agent, the authors achieved a top leaderboard score of 64.38 by merging outputs from agents ranked 7th to 13th, none of which individually scored above 45. This approach capitalized on the diverse coverage of facts across multiple reports, enhancing Information Recall and demonstrating that combining existing agents can outperform refining a single one. The method involves agglomerative pairwise merging, where reports are fused iteratively to preserve factual information, thus improving overall task performance without developing a new agent. This strategy not only highlights the potential of leveraging existing resources but also suggests that as the number of available agents grows, the ability to extract more comprehensive insights from them will become increasingly significant.
No tracked trend matches for this post yet.