Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents
Blog post from HuggingFace
VAKRA is an innovative benchmark designed to evaluate AI agents' reasoning and action capabilities in complex, enterprise-like environments by testing their ability to perform compositional reasoning across APIs and documents. It challenges agents with multi-step workflows, requiring them to interact with over 8,000 locally hosted APIs and domain-aligned document collections across 62 domains. The evaluation framework of VAKRA focuses on execution-centric metrics, assessing agents on their ability to execute coherent workflows and produce correct answers. Four key capabilities are tested: API chaining using business intelligence APIs, tool selection using dashboard APIs, multi-hop reasoning, and multi-hop, multi-source reasoning with policy adherence. The article details the performance of various models on these tasks, highlights error types, and emphasizes the gap between surface-level tool competence and robust end-to-end agent reliability, revealing that while modern models can select APIs and execute isolated tool calls, they struggle to incorporate external constraints into their reasoning for reliable real-world deployment.