vCX-Hard: Benchmarking Leading AI Models on Real Contact Center Calls
Blog post from Retell AI
vCX-Hard is a benchmark developed by Retell to evaluate AI models used in voice agents on real contact center calls, specifically focusing on their ability to remain grounded in truth and execute correct actions. Built from diverse production call data across various industries, the benchmark assesses models on two critical axes: non-hallucination rate and tool-call correctness rate. It highlights the importance of choosing the right model, as current leading models like GPT-5.5 achieve only around 88% accuracy on the most challenging calls. The dataset was curated from production call traffic, ensuring a broad distribution and avoiding domination by high-volume organizations. The evaluation process involves a multi-stage pipeline to identify cases of hallucination and tool-calling failures, with a panel of leading frontier models grading the cases. The benchmark reveals that reasoning capabilities significantly impact performance, with reasoning-enabled models generally scoring higher, albeit at the cost of increased latency. Despite improvements over model generations, no model has yet reached perfection, emphasizing the necessity for continuous development and adaptation to ensure reliable deployment in live environments.