Multilingual Speech-to-Text: Achieving Native-Level Accuracy in 60+ Languages
Blog post from Agora
The conversation with Klemen Simonic, Founder and CEO of Soniox, highlights a significant shift in the approach to speech AI, focusing on achieving native-level performance across over 60 languages rather than incremental improvements in English. Unlike traditional models that prioritize English, Soniox employs a self-supervised learning strategy on vast amounts of audio data to create a universal model capable of fluent multilingual understanding, addressing the "Global Entity" problem by learning concepts in one language and applying them across others. This approach contrasts with OpenAI's Whisper, as Soniox emphasizes low-latency, streaming ASR and minimizes hallucinations, critical for applications like medical and legal fields. Soniox's real-time translation model reduces latency significantly, allowing for seamless conversation flow, which is vital in global business, accessibility, and healthcare. The discussion also touches on the future of AI, where Klemen envisions self-evolving systems moving toward Artificial General Intelligence, capable of contextual understanding beyond mere transcription, emphasizing the importance of consistent performance across diverse real-world scenarios.