Benchmarking LLMs for Voice Agent Use Cases
Blog post from Daily
A new benchmark has been released to evaluate the performance of large language models (LLMs) and speech-to-speech models in lengthy, multi-turn conversations, specifically for voice agent applications. This benchmark assesses tool calling, instruction following, and factual grounding, addressing the rapid growth of voice agent adoption in complex enterprise settings. Despite advancements, the best-performing models, which now score 100% on this benchmark, are too slow for practical voice agent use due to latency issues. Most production voice agents currently rely on text-mode LLMs like GPT-4.1 and Gemini 2.5 Flash, although newer models such as AWS Nova 2 Pro also show promising results. The benchmark highlights the capabilities gap between text-mode and speech-to-speech models, but open weights models like Ultravox are closing this gap. The benchmark aims to provide reproducible, open-source tools for evaluating voice AI, with a focus on improving latency and ensuring robust performance across various real-world scenarios. Despite challenges in benchmark design and execution, the initiative seeks to foster collaboration and innovation in the voice AI community, encouraging contributions to refine and expand testing methodologies.