Company
Date Published
Author
Pratik Bhavsar
Word count
4316
Language
English
Hacker News points
None

Summary

Agent Leaderboard v2 has been developed to evaluate AI agents in real-world enterprise settings, addressing limitations seen in its predecessor by introducing more complex, multi-turn, and domain-specific scenarios across industries like banking, healthcare, telecom, investment, and insurance. The initiative aims to assess AI models based on two key metrics: Action Completion (AC), which measures the agent's ability to accomplish user goals, and Tool Selection Quality (TSQ), which evaluates the precision and appropriateness of tool usage. The updated leaderboard highlights notable performances such as GPT-4.1 leading in overall AC with a 62% score, while Gemini-2.5-flash excels in TSQ with 94%. The synthetic dataset built specifically for this evaluation reflects the complexities of real-world tasks, with tools and personas crafted to simulate realistic user interactions. This approach provides enterprises with actionable insights into how AI models perform in specific domains, addressing gaps left by generic benchmarks and offering a more nuanced understanding of model capabilities in industry-specific contexts.