Home / Companies / AssemblyAI / Blog / Post Details
Content Deep Dive

How speech recognition errors compound in production voice agents

Blog post from AssemblyAI

Post Details
Company
Date Published
Author
Devon Malloy
Word Count
2,798
Language
English
Hacker News Points
-
Summary

In the context of production voice agents, standard benchmarks that measure word error rate (WER) often fail to capture the critical nuances of real-world usage, where entity accuracy—focusing on specific values such as names, account numbers, and medication names—is paramount. Errors in these areas can lead to significant issues, as voice agents misinterpret crucial information that downstream systems rely on, thereby compounding across conversation turns. This discrepancy underscores the importance of evaluating speech-to-text models based on their missed entity rate rather than WER, as the latter does not account for the accuracy required in capturing exact values needed for effective operation. Notably, voice agent builders rank speech-to-text (STT) accuracy as the most important factor, even above latency and cost, because the quality of transcripts directly affects the reliability of downstream processes. AssemblyAI's Universal-3 Pro Streaming model addresses this by offering capabilities such as domain promptability and keyterms boosting, which allow the model to adapt to specific vocabularies and contexts, thereby enhancing entity accuracy and reducing errors that could disrupt service in high-stakes environments like healthcare and finance.