Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Ayhan Sebin, Rohan Arora, and Saurabh Jha
Word Count
2,253
Language
-
Hacker News Points
-
Summary

IBM and UC Berkeley collaborated to investigate the underlying reasons why agentic systems, specifically Large Language Models (LLMs), fail in real-world IT automation tasks using ITBench and MAST frameworks. The study applied the Multi-Agent System Failure Taxonomy (MAST) to analyze ITBench execution traces, identifying distinct failure patterns across models like Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B. It was found that models like Gemini-3-Flash tend to fail due to isolated issues such as incorrect verification, while more complex models like GPT-OSS-120B experience cascading failures due to a combination of reasoning mismatches and loss of context. The research highlights the importance of distinguishing between recoverable and fatal failures, suggesting that targeted engineering interventions can significantly improve system robustness. For instance, implementing external verification and stricter termination controls can mitigate overconfidence and termination issues in models, thus enhancing their reliability in IT automation tasks.