Named Entity Recognition for Voice: Extracting Structure from Transcripts
Blog post from Deepgram
Named Entity Recognition (NER) on voice transcripts faces significant challenges compared to traditional text due to the inherent differences in Automatic Speech Recognition (ASR) output, which often lacks capitalization and punctuation, leading to a loss of crucial formatting cues that models rely on. Despite advancements in architectures like pipeline approaches, LLM-based extraction, and joint audio-to-entity models, issues such as ASR error propagation, particularly in domain-specific entities, persistently degrade accuracy. To improve NER outcomes, enhancing the quality of transcripts through better Speech-to-Text (STT) accuracy, smart formatting, and keyterm prompting is essential. For real-time applications, the ASR-then-NER pipeline remains the most feasible, while batch processing benefits from LLM-based approaches for higher accuracy on rare entities. Understanding the nuances of entity-level Word Error Rates (WER) rather than just aggregate WER is critical, as transcription errors concentrated in entity spans have a disproportionately negative impact on NER performance.