Home / Companies / AssemblyAI / Blog / Post Details
Content Deep Dive

Review - JUST: Joint Unsupervised and Supervised Training For Multilingual ASR

Blog post from AssemblyAI

Post Details
Company
Date Published
Author
Luka Chkhetiani
Word Count
717
Language
English
Hacker News Points
-
Summary

The paper "JUST - JOINT UNSUPERVISED AND SUPERVISED TRAINING FOR MULTILINGUAL ASR" presents a novel Wav2Vec2-inspired pre-training technique for multilingual automatic speech recognition (ASR). JUST utilizes a five-stage modeling architecture with three stage-level unsupervised and supervised loss functions. The proposed approach achieves a 32% performance increase over the first-stage Wav2Vec2 XLSR network in low-resource language ASR settings. Key findings include the use of contrastive MLM (Masked Language Modelling) and RNN-T losses for joint pre-training on audio-text pairs across multiple languages, leading to more useful information extraction, better generalization, and robust contextualized token prediction. JUST outperforms Wav2Vec2 by using only the MLS dataset for pre-training, demonstrating its effectiveness in multilingual ASR tasks with fewer data requirements.