Home / Companies / Daily / Blog / Post Details
Content Deep Dive

Benchmarking LLMs for Voice Agent Use Cases

Blog post from Daily

Post Details
Company
Date Published
Author
Kwindla Hultman Kramer
Word Count
4,094
Language
English
Hacker News Points
-
Summary

A new benchmark has been released to evaluate the performance of large language models (LLMs) and speech-to-speech models in lengthy, multi-turn conversations, specifically for voice agent applications. This benchmark assesses tool calling, instruction following, and factual grounding, addressing the rapid growth of voice agent adoption in complex enterprise settings. Despite advancements, the best-performing models, which now score 100% on this benchmark, are too slow for practical voice agent use due to latency issues. Most production voice agents currently rely on text-mode LLMs like GPT-4.1 and Gemini 2.5 Flash, although newer models such as AWS Nova 2 Pro also show promising results. The benchmark highlights the capabilities gap between text-mode and speech-to-speech models, but open weights models like Ultravox are closing this gap. The benchmark aims to provide reproducible, open-source tools for evaluating voice AI, with a focus on improving latency and ensuring robust performance across various real-world scenarios. Despite challenges in benchmark design and execution, the initiative seeks to foster collaboration and innovation in the voice AI community, encouraging contributions to refine and expand testing methodologies.