EnterpriseBench: CoreCraft – Measuring AI Agents in Chaotic, Enterprise RL Environments
Blog post from Surge AI
EnterpriseBench, a suite of reinforcement learning environment benchmarks, evaluates AI agents on high-value job functions within realistic enterprise settings, using a startup called CoreCraft as the initial test environment. CoreCraft challenges agents with tasks such as navigating complex databases, managing customer interactions, and adhering to company policies, reflecting real-world enterprise operations. Despite the sophistication of state-of-the-art models like GPT-5.2 and Claude Opus 4.6, they solved fewer than 30% of the tasks, often faltering due to hallucinations and reasoning errors. Training improvements were seen with the GLM 4.6 model, which demonstrated gains in executing multi-step workflows and handling constraints. These advancements were not only evident within CoreCraft but also transferred to external benchmarks, suggesting the acquisition of generalizable skills. The initiative aims to expand by building environments for other job families, enhancing the practical applicability of AI in enterprise contexts.