vCX-Hard: Benchmarking Leading AI Models on Real Contact Center Calls

Post Details

Company

Retell AI

Date Published

June 5, 2026

Author

-

Word Count

2,176

Company Posts That Month

44

Language

English

Hacker News Points

-

Post removed?

No

Source URL

www.retellai.com/blog/vcx-hard-benchmark

Summary

vCX-Hard is a benchmark developed by Retell to evaluate AI models used in voice agents on real contact center calls, specifically focusing on their ability to remain grounded in truth and execute correct actions. Built from diverse production call data across various industries, the benchmark assesses models on two critical axes: non-hallucination rate and tool-call correctness rate. It highlights the importance of choosing the right model, as current leading models like GPT-5.5 achieve only around 88% accuracy on the most challenging calls. The dataset was curated from production call traffic, ensuring a broad distribution and avoiding domination by high-volume organizations. The evaluation process involves a multi-stage pipeline to identify cases of hallucination and tool-calling failures, with a panel of leading frontier models grading the cases. The benchmark reveals that reasoning capabilities significantly impact performance, with reasoning-enabled models generally scoring higher, albeit at the cost of increased latency. Despite improvements over model generations, no model has yet reached perfection, emphasizing the necessity for continuous development and adaptation to ensure reliable deployment in live environments.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Voice AI	9	3,084	268	57	-11%
LLM	5	6,196	1,155	243	-32%
AI Agents	1	6,005	1,359	264	+22%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.