Home / Companies / Together AI / Blog / Post Details
Content Deep Dive

How speech models fail where it matters the most and what to do about it

Blog post from Together AI

Post Details
Company
Date Published
Author
Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou
Word Count
983
Language
English
Hacker News Points
-
Summary

Automatic speech recognition (ASR) systems, while achieving near-human parity on general benchmarks, struggle significantly with transcribing short, high-stakes utterances such as street names, especially when pronounced by non-English speakers. Evaluations using the SF Streets and US Streets datasets reveal that models from companies like OpenAI, Deepgram, Google, and Microsoft exhibit an average error rate of 39% for street names, highlighting a gap in real-world reliability and an 18-point accuracy disparity between English and non-English primary speakers. This disparity leads to costly practical consequences, such as increased taxi fares and driving time. To address these issues, researchers developed a synthetic data generation technique using cross-lingual style transfer, which improves ASR performance by up to 60% with minimal data. This method demonstrates that enhancing model robustness is achievable without extensive data collection, and the release of the SF Streets and US Streets datasets aims to encourage further research into improving ASR systems' reliability in diverse linguistic environments.