How speech models fail where it matters the most and what to do about it

Post Details

Company

Together AI

Date Published

Feb. 23, 2026

Author

Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou

Word Count

983

Language

English

Hacker News Points

-

Source URL

www.together.ai/blog/how-speech-models-fail

Summary

Automatic speech recognition (ASR) systems, while achieving near-human parity on general benchmarks, struggle significantly with transcribing short, high-stakes utterances such as street names, especially when pronounced by non-English speakers. Evaluations using the SF Streets and US Streets datasets reveal that models from companies like OpenAI, Deepgram, Google, and Microsoft exhibit an average error rate of 39% for street names, highlighting a gap in real-world reliability and an 18-point accuracy disparity between English and non-English primary speakers. This disparity leads to costly practical consequences, such as increased taxi fares and driving time. To address these issues, researchers developed a synthetic data generation technique using cross-lingual style transfer, which improves ASR performance by up to 60% with minimal data. This method demonstrates that enhancing model robustness is achievable without extensive data collection, and the release of the SF Streets and US Streets datasets aims to encourage further research into improving ASR systems' reliability in diverse linguistic environments.