Introducing Cohere-transcribe: state-of-the-art speech recognition
Blog post from HuggingFace
Cohere-transcribe-03-2026 is a newly launched 2-billion-parameter speech recognition model from CohereLabs, designed to deliver state-of-the-art accuracy across 14 enterprise-critical languages and is open-sourced on Hugging Face under an Apache 2.0 license. The model outperforms existing proprietary and open-source competitors in English, taking the top spot on the Hugging Face Open ASR Leaderboard, and shows comparable or superior performance in the other 13 languages. Built with an encoder-decoder X-attention transformer architecture, the model emphasizes efficiency and accuracy by dedicating over 90% of its parameters to the encoder, allowing for minimal autoregressive inference compute. Cohere-transcribe was trained on 0.5 million hours of curated audio and transcripts, supplemented with synthetic data, and utilizes a multilingual tokenizer with byte fallback to handle varied language inputs. The model's production viability is enhanced through collaboration with vLLM for efficient, scalable deployment, achieving up to twice the throughput compared to similar models. Despite its strengths, the model is not specifically trained for code-switched audio and may require a noise gate or voice activity detection to avoid errors from non-speech sounds. Cohere-transcribe represents a significant step in Cohere's efforts to enhance audio experiences on their North enterprise platform, with the model available for experimentation via Hugging Face and Cohere's API.