Company
Date Published
Author
Julia Strout
Word count
1699
Language
English
Hacker News points
None

Summary

Deepgram's exploration of keyterm boosting in speech-to-text (STT) models reveals various strategies for enhancing transcription accuracy, particularly for proper nouns unlikely to appear in training data. The company transitioned from its Nova-3 model to the Flux model, optimizing for real-time streaming by reducing computational costs while maintaining accuracy. Keyterm boosting involves providing the system with expected words to improve transcription likelihood, with methods ranging from modifying outputs to using separate re-ranker models or text-based post-processing. The Flux model further refines this by integrating keyterms as part of the training, allowing the model to learn optimal boosting behaviors, thereby easing customer burden and improving efficiency. This evolution addresses challenges such as permutation invariance and computational costs, ultimately achieving a significant reduction in per-token decoding cost and memory usage while maintaining performance comparable to Nova-3.