Building real-time multilingual ASR with code-switching
Blog post from Gladia
Bruno Hays, a Lead ML Speech Engineer at Gladia, developed a novel approach to improve real-time multilingual automatic speech recognition (ASR) with code-switching by creating a lightweight, modular ensemble system that efficiently routes between small, specialized models instead of relying on a large multilingual model. This system, which is fully open source, uses a Voice Activity Detection (VAD) component to identify speech boundaries, Streaming Zipformer models for ASR, and a Language Identification (LID) system for detecting language switches. The Asynchronous Rollback Pipeline method reduces language lag by instantly transcribing audio with the active ASR engine, monitoring for language changes, and adjusting the transcription as needed. This approach outperforms larger models in inter-utterance code-switching scenarios, achieving a 13% Word Error Rate (WER), but struggles with intra-utterance switching, where it falls behind cloud APIs despite performing better than some local models. The results suggest that future ASR systems could benefit from using small, specialized models with intelligent routing, offering a more efficient solution for local, on-device multilingual ASR tasks.