IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

Post Details

Company

Hugging Face

Date Published

Feb. 18, 2026

Author

Ayhan Sebin, Rohan Arora, and Saurabh Jha

Word Count

2,253

Company Posts That Month

55

Language

-

Hacker News Points

-

Post removed?

No

Source URL

huggingface.co/blog/ibm-research/itbenchandmast

Summary

IBM and UC Berkeley collaborated to investigate the underlying reasons why agentic systems, specifically Large Language Models (LLMs), fail in real-world IT automation tasks using ITBench and MAST frameworks. The study applied the Multi-Agent System Failure Taxonomy (MAST) to analyze ITBench execution traces, identifying distinct failure patterns across models like Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B. It was found that models like Gemini-3-Flash tend to fail due to isolated issues such as incorrect verification, while more complex models like GPT-OSS-120B experience cascading failures due to a combination of reasoning mismatches and loss of context. The research highlights the importance of distinguishing between recoverable and fatal failures, suggesting that targeted engineering interventions can significantly improve system robustness. For instance, implementing external verification and stricter termination controls can mitigate overconfidence and termination issues in models, thus enhancing their reliability in IT automation tasks.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Kubernetes	3	1,380	245	88	+48%
Multi-agent systems	3	380	114	51	-10%
LLM	2	5,138	781	181	+34%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.