Home / Companies / Daily / Blog / Post Details
Content Deep Dive

Benchmarking STT for Voice Agents

Blog post from Daily

Post Details
Company
Date Published
Author
Mark Backman
Word Count
3,090
Language
English
Hacker News Points
-
Summary

A new benchmark has been released to evaluate Speech-to-Text (STT) providers, focusing on transcription latency and semantic accuracy for real-time voice agents. This benchmark aims to assess STT performance in terms of how quickly and accurately a voice agent can transcribe spoken inputs for language model processing, emphasizing that transcription accuracy should prioritize conveying user intent over perfect word-for-word transcription. The benchmark analyzed various STT services on real-world audio samples, highlighting the trade-offs between speed and accuracy, and introduced the concept of Semantic Word Error Rate (WER) to better measure transcription quality for voice AI applications. The results showed that while latency varies significantly among services, the overall accuracy of STT providers has improved dramatically, with three services—Deepgram, Soniox, and Speechmatics—standing out for balancing speed and accuracy. The importance of P95 latency, which reflects the worst-case latency experience, was emphasized, alongside the consideration of finalization support and turn detection for optimizing voice agent interactions. The benchmark tool is available as an open-source utility for developers to test and improve their STT configurations.