Why AI Agents Score Just 2% on Critical Evaluation Tests

Company

Galileo

Date Published

July 25, 2025

Author

Conor Bronsdon

Word count

1696

Language

English

Hacker News points

None

URL

galileo.ai/blog/agent-evaluation-research

Summary

The comprehensive survey on LLM-agent evaluation highlights the critical challenges and gaps in current evaluation methodologies for AI agents. While these agents can perform complex tasks such as drafting contracts and triaging customer tickets, their live deployment raises concerns about safety, cost-efficiency, and reliability. The survey synthesizes insights from over 100 benchmarks and frameworks into four dimensions: fundamental capabilities, application-specific tasks, generalist reasoning, and evaluation frameworks. It reveals that traditional metrics often fail to capture the non-deterministic and emergent behaviors of autonomous agents, leading to low success rates on difficult tasks. The study emphasizes the importance of addressing evaluation challenges, including safety compliance, cost-efficiency, fine-grained analysis, scalability, and realistic dynamic environments. As the field evolves, the survey suggests that a multi-dimensional evaluation approach is essential for building safer and more reliable agent systems, underscoring the necessity of integrating safety, cost, and diagnostic measures into daily workflows to ensure trustworthy deployments.

Why AI Agents Score Just 2% on Critical Evaluation Tests | Galileo

Summary