Guardian Agents Benchmark

Post Details

Company

Vectara

Date Published

Nov. 21, 2025

Author

Vishal Naik and Chenyu Xu

Word Count

2,057

Language

English

Hacker News Points

-

Source URL

www.vectara.com/blog/guardian-agents-benchmark

Summary

Agentic AI platforms represent a significant evolution in AI performance evaluation, requiring new benchmarks that focus on decision quality, tool usage, and workflow execution rather than traditional text-generation metrics. Current benchmarks fall short because they either test isolated tool-use prediction or simulate agent behavior in artificial environments, failing to capture real-world complexities. To address this, a new platform-agnostic benchmark has been developed to evaluate agents within real agentic platforms, assessing their ability to execute workflows accurately across multiple domains, such as email management, calendar scheduling, and financial analysis. This benchmark emphasizes both response correctness and action trace correctness, revealing that while agents often produce fluent responses, they struggle with correct tool usage and workflow sequencing. To improve reliability, the concept of "Guardian Agents" is introduced as an early-stage validation layer that checks for unnecessary tools, missing required tools, and argument correctness before execution, aiming to reduce errors and enhance agent safety. The integration of Guardian Agents into the Vectara platform as a pre-execution safety feature is planned, with the goal of increasing the reliability and safety of agentic AI in real-world applications.