Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

Post Details

Company

HuggingFace

Date Published

April 15, 2026

Author

Ankita Naik, danish, Ben, Anupama Murthi, and Praveen

Word Count

3,111

Company Posts That Month

61

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/ibm-research/vakra-benchmark-analysis

Summary

VAKRA is an innovative benchmark designed to evaluate AI agents' reasoning and action capabilities in complex, enterprise-like environments by testing their ability to perform compositional reasoning across APIs and documents. It challenges agents with multi-step workflows, requiring them to interact with over 8,000 locally hosted APIs and domain-aligned document collections across 62 domains. The evaluation framework of VAKRA focuses on execution-centric metrics, assessing agents on their ability to execute coherent workflows and produce correct answers. Four key capabilities are tested: API chaining using business intelligence APIs, tool selection using dashboard APIs, multi-hop reasoning, and multi-hop, multi-source reasoning with policy adherence. The article details the performance of various models on these tasks, highlights error types, and emphasizes the gap between surface-level tool competence and robust end-to-end agent reliability, revealing that while modern models can select APIs and execute isolated tool calls, they struggle to incorporate external constraints into their reasoning for reliable real-world deployment.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
RAG	11	941	216	85	-48%
MCP	5	6,108	613	170	+36%
AI Agents	2	4,430	1,100	236	-3%
LLM	2	5,932	1,046	223	-2%