Home / Companies / Surge AI / Blog / Post Details
Content Deep Dive

EnterpriseBench: CoreCraft – Measuring AI Agents in Chaotic, Enterprise RL Environments

Blog post from Surge AI

Post Details
Company
Date Published
Author
-
Word Count
4,038
Language
English
Hacker News Points
-
Summary

EnterpriseBench, a suite of reinforcement learning environment benchmarks, evaluates AI agents on high-value job functions within realistic enterprise settings, using a startup called CoreCraft as the initial test environment. CoreCraft challenges agents with tasks such as navigating complex databases, managing customer interactions, and adhering to company policies, reflecting real-world enterprise operations. Despite the sophistication of state-of-the-art models like GPT-5.2 and Claude Opus 4.6, they solved fewer than 30% of the tasks, often faltering due to hallucinations and reasoning errors. Training improvements were seen with the GLM 4.6 model, which demonstrated gains in executing multi-step workflows and handling constraints. These advancements were not only evident within CoreCraft but also transferred to external benchmarks, suggesting the acquisition of generalizable skills. The initiative aims to expand by building environments for other job families, enhancing the practical applicability of AI in enterprise contexts.