Transcribing heavy accents: why ASR struggles, and how model scale helps
Blog post from AssemblyAI
Automatic Speech Recognition (ASR) systems face significant challenges in accurately transcribing heavy accents due to data and model-capacity limitations, not because of speaker clarity. Accents cause problems for ASR models because they often lack sufficient training data for diverse accents, leading these models to default to more common pronunciations. Traditional fixes, such as accent-specific models or pronunciation dictionaries, have proven ineffective as they require prior knowledge of the accent and don't address diverse pronunciation. Scaling ASR models, like the Universal-3 Pro with an LLM-based decoder, improves performance by incorporating more parameters and diverse training data, allowing models to hold multiple pronunciations in mind and use linguistic context to resolve ambiguities. This approach, demonstrated by a lower Word Error Rate (WER) on the CommonVoice dataset, provides more accurate transcription of varied accents without needing to pre-select accent types. Techniques like keyterms and general prompting further enhance accuracy by anchoring the model on predictable vocabulary and providing useful context, ultimately making ASR systems more robust against the challenges posed by naturally occurring accent variation in global audio.
No tracked trend matches for this post yet.