Launching Agent Leaderboard v2: The Enterprise-Grade Benchmark for AI Agents

Post Details

Company

Galileo

Date Published

July 17, 2025

Author

Pratik Bhavsar

Word Count

4,316

Language

English

Hacker News Points

-

Source URL

galileo.ai/blog/agent-leaderboard-v2

Summary

Agent Leaderboard v2 has been developed to evaluate AI agents in real-world enterprise settings, addressing limitations seen in its predecessor by introducing more complex, multi-turn, and domain-specific scenarios across industries like banking, healthcare, telecom, investment, and insurance. The initiative aims to assess AI models based on two key metrics: Action Completion (AC), which measures the agent's ability to accomplish user goals, and Tool Selection Quality (TSQ), which evaluates the precision and appropriateness of tool usage. The updated leaderboard highlights notable performances such as GPT-4.1 leading in overall AC with a 62% score, while Gemini-2.5-flash excels in TSQ with 94%. The synthetic dataset built specifically for this evaluation reflects the complexities of real-world tasks, with tools and personas crafted to simulate realistic user interactions. This approach provides enterprises with actionable insights into how AI models perform in specific domains, addressing gaps left by generic benchmarks and offering a more nuanced understanding of model capabilities in industry-specific contexts.