IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST
Blog post from HuggingFace
IBM and UC Berkeley collaborated to investigate the underlying reasons why agentic systems, specifically Large Language Models (LLMs), fail in real-world IT automation tasks using ITBench and MAST frameworks. The study applied the Multi-Agent System Failure Taxonomy (MAST) to analyze ITBench execution traces, identifying distinct failure patterns across models like Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B. It was found that models like Gemini-3-Flash tend to fail due to isolated issues such as incorrect verification, while more complex models like GPT-OSS-120B experience cascading failures due to a combination of reasoning mismatches and loss of context. The research highlights the importance of distinguishing between recoverable and fatal failures, suggesting that targeted engineering interventions can significantly improve system robustness. For instance, implementing external verification and stricter termination controls can mitigate overconfidence and termination issues in models, thus enhancing their reliability in IT automation tasks.