Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Ankita Naik, danish, Ben, Anupama Murthi, and Praveen
Word Count
3,111
Company Posts That Month
61
Language
-
Hacker News Points
-
Summary

VAKRA is an innovative benchmark designed to evaluate AI agents' reasoning and action capabilities in complex, enterprise-like environments by testing their ability to perform compositional reasoning across APIs and documents. It challenges agents with multi-step workflows, requiring them to interact with over 8,000 locally hosted APIs and domain-aligned document collections across 62 domains. The evaluation framework of VAKRA focuses on execution-centric metrics, assessing agents on their ability to execute coherent workflows and produce correct answers. Four key capabilities are tested: API chaining using business intelligence APIs, tool selection using dashboard APIs, multi-hop reasoning, and multi-hop, multi-source reasoning with policy adherence. The article details the performance of various models on these tasks, highlights error types, and emphasizes the gap between surface-level tool competence and robust end-to-end agent reliability, revealing that while modern models can select APIs and execute isolated tool calls, they struggle to incorporate external constraints into their reasoning for reliable real-world deployment.

Trends Found in this Post
Trend Post Mentions Total Month Mentions Posts Companies MoM
RAG 11 941 216 85 -48%
MCP 5 6,108 613 170 +36%
AI Agents 2 4,430 1,100 236 -3%
LLM 2 5,932 1,046 223 -2%