Borealis — open data, code, weights recipe for training Audio LLM
Blog post from HuggingFace
Borealis is an open-source, audio-language model developed by VikhrModels, designed to handle both Russian and English languages, and it aims to enhance audio understanding beyond transcription. The model utilizes Whisper3-large as the audio encoder and Qwen 4B as the language model backbone, with an adapter to bridge them. Borealis is developed to summarize lengthy recordings, answer content-related questions, and interpret tone and emotion. The training involved multiple datasets, focusing on the importance of native data over multilingual datasets and the nuanced role of text data in training. In terms of architecture, Borealis employs a frozen Whisper encoder, a four-times downsampler, and a fine-tuned Qwen3-4B language model using LoRA, resulting in a total of approximately 5 billion parameters. The model shows strong cross-lingual transfer capabilities, although native audio data still performs better, and excessive text data can degrade performance. Borealis addresses challenges such as noise and complex acoustic environments, highlighting the need for separate tuning for noisy audio. Additionally, Borealis offers practical insights into serving and integrating with transformer models, emphasizing the importance of pretraining and the challenges associated with audio longer than 30 seconds, heavy noise, and offline-only streaming.