Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Borealis — open data, code, weights recipe for training Audio LLM

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Wortega
Word Count
2,303
Language
-
Hacker News Points
-
Summary

Borealis is an open-source, audio-language model developed by VikhrModels, designed to handle both Russian and English languages, and it aims to enhance audio understanding beyond transcription. The model utilizes Whisper3-large as the audio encoder and Qwen 4B as the language model backbone, with an adapter to bridge them. Borealis is developed to summarize lengthy recordings, answer content-related questions, and interpret tone and emotion. The training involved multiple datasets, focusing on the importance of native data over multilingual datasets and the nuanced role of text data in training. In terms of architecture, Borealis employs a frozen Whisper encoder, a four-times downsampler, and a fine-tuned Qwen3-4B language model using LoRA, resulting in a total of approximately 5 billion parameters. The model shows strong cross-lingual transfer capabilities, although native audio data still performs better, and excessive text data can degrade performance. Borealis addresses challenges such as noise and complex acoustic environments, highlighting the need for separate tuning for noisy audio. Additionally, Borealis offers practical insights into serving and integrating with transformer models, emphasizing the importance of pretraining and the challenges associated with audio longer than 30 seconds, heavy noise, and offline-only streaming.