Multilingual speech-to-text on your laptop: NVIDIA's Nemotron 3.5 ASR
Blog post from LiveKit
NVIDIA's Nemotron 3.5 ASR is a 600 million-parameter streaming speech recognition model capable of transcribing 40 language-locales with remarkable speed and efficiency, suitable for running on a laptop. The model employs a language-ID prompt to direct decoding, allowing a single set of weights to handle multiple languages such as English, Spanish, and Japanese, with a sub-100ms end-of-utterance latency that ensures transcripts keep pace with spoken words. This post explores its applications, particularly its integration in NeMo, OpenAI-compatible servers, and LiveKit voice agents, highlighting its real-time processing capabilities and local execution on devices like CPUs and Apple Silicon, thus eliminating cloud dependency and reducing costs. A key feature is its multilingual teleprompter functionality, where the model's streaming and low-latency attributes enable the script to scroll in sync with the user's voice, further enhanced by a clever matching algorithm that maintains accuracy and responsiveness. This model stands out in the multilingual streaming space for its speed and local execution, making it a unique solution in speech recognition technology.