Real-time STT latency benchmarks: what "fast enough" means for voice agents
Blog post from AssemblyAI
In the evaluation of real-time speech-to-text (STT) models for voice agents, latency is a critical factor, but focusing solely on achieving the lowest latency can be misleading. Instead, it is important to balance speed with accuracy to ensure that the voice agent does not sacrifice correctness for speed. The optimal latency for voice agents should align with human conversational rhythms, around a 200-millisecond gap between turns, while accounting for the entire response loop, including STT, large language model (LLM) processing, and text-to-speech (TTS) conversion. Universal-3.5 Pro Realtime, for example, offers a competitive word error rate while maintaining effective end-of-turn detection at around 300 milliseconds, ensuring that STT does not become the bottleneck in the response loop. Evaluating models should focus on time to complete turn (TTCT) and consider P95 latency, which accounts for real-world conditions where one in twenty interactions may experience delays, rather than just median (P50) latency. By prioritizing a model that balances speed and accuracy, voice agents can deliver a more reliable and user-friendly experience.