Real-time STT latency benchmarks: what "fast enough" means for voice agents

Post Details

Company

AssemblyAI

Date Published

June 24, 2026

Author

Kelsey Foster

Word Count

2,204

Company Posts That Month

28

Language

English

Hacker News Points

-

Source URL

www.assemblyai.com/blog/real-time-stt-latency-benchmarks-voice-agents

Summary

In the evaluation of real-time speech-to-text (STT) models for voice agents, latency is a critical factor, but focusing solely on achieving the lowest latency can be misleading. Instead, it is important to balance speed with accuracy to ensure that the voice agent does not sacrifice correctness for speed. The optimal latency for voice agents should align with human conversational rhythms, around a 200-millisecond gap between turns, while accounting for the entire response loop, including STT, large language model (LLM) processing, and text-to-speech (TTS) conversion. Universal-3.5 Pro Realtime, for example, offers a competitive word error rate while maintaining effective end-of-turn detection at around 300 milliseconds, ensuring that STT does not become the bottleneck in the response loop. Evaluating models should focus on time to complete turn (TTCT) and consider P95 latency, which accounts for real-world conditions where one in twenty interactions may experience delays, rather than just median (P50) latency. By prioritizing a model that balances speed and accuracy, voice agents can deliver a more reliable and user-friendly experience.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Real-time	38	5,457	1,338	238	-5%
Voice AI	17	2,232	214	48	-36%
LLM	13	5,172	1,006	220	-43%