Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Ayhan Sebin, Saurabh Jha, and Rohan Arora
Word Count
889
Language
-
Hacker News Points
-
Summary

ITBench-AA is a strategic benchmark developed by Artificial Analysis and IBM to evaluate AI models on agentic enterprise IT tasks, specifically focusing initially on Site Reliability Engineering (SRE). The benchmark assesses models' abilities to diagnose complex Kubernetes systems by analyzing logs, traces, and infrastructure dependencies to identify root causes of incidents. Despite leveraging IBM's expertise in enterprise IT operations, all frontier models scored below 50% on these tasks, highlighting the challenge of this benchmark. The top-performing model, Claude Opus 4.7, achieved a 47% success rate, with other models like GPT-5.5 and Qwen3.7 Max closely following. The methodology involves using a Stirrup reference harness, a sandboxed environment where models can interact via shell commands, with performance scored based on precision at full recall. This setup ensures consistent comparison across models, with the benchmark revealing that models with longer investigation trajectories did not necessarily yield higher accuracy, and cost considerations also varied significantly among models.