Factors affecting the accuracy of speech-to-text transcripts
Blog post from Gladia
Speech-to-text (STT) accuracy in production settings often falls short due to a gap between controlled studio conditions and the complex, multilingual, and overlapping speech from real users. This discrepancy is influenced by four main factors: audio quality, speaker traits, domain vocabulary deficits, and the diversity of model training data. While Word Error Rate (WER) is a key metric for assessing transcription quality, it doesn't fully capture the nuances of production risk, which also depends on semantic accuracy and Diarization Error Rate (DER). Solaria-1, a benchmarked model, demonstrates significant improvements in WER and DER compared to alternatives, emphasizing the importance of real-world evaluation conditions. Models are challenged by input audio issues like sample rate and codec choice, speaker diversity including accents and code-switching, and domain-specific vocabulary gaps. Solutions such as custom vocabulary injection and diverse training data can mitigate these challenges. Evaluating STT systems requires building a golden dataset reflecting actual use conditions to measure true performance, particularly for applications in contact centers and other conversational environments.