ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

Post Details

Company

Hugging Face

Date Published

May 27, 2026

Author

Ayhan Sebin, Saurabh Jha, and Rohan Arora

Word Count

889

Company Posts That Month

55

Language

-

Hacker News Points

-

Post removed?

No

Source URL

huggingface.co/blog/ibm-research/itbench-aa

Summary

ITBench-AA is a strategic benchmark developed by Artificial Analysis and IBM to evaluate AI models on agentic enterprise IT tasks, specifically focusing initially on Site Reliability Engineering (SRE). The benchmark assesses models' abilities to diagnose complex Kubernetes systems by analyzing logs, traces, and infrastructure dependencies to identify root causes of incidents. Despite leveraging IBM's expertise in enterprise IT operations, all frontier models scored below 50% on these tasks, highlighting the challenge of this benchmark. The top-performing model, Claude Opus 4.7, achieved a 47% success rate, with other models like GPT-5.5 and Qwen3.7 Max closely following. The methodology involves using a Stirrup reference harness, a sandboxed environment where models can interact via shell commands, with performance scored based on precision at full recall. This setup ensures consistent comparison across models, with the benchmark revealing that models with longer investigation trajectories did not necessarily yield higher accuracy, and cost considerations also varied significantly among models.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Kubernetes	6	1,965	371	106	-15%
AI Guardrails	1	216	116	52	-40%
Observability	1	3,421	707	180	-24%
OpenTelemetry	1	945	122	49	-21%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.