Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Introducing Cohere-transcribe: state-of-the-art speech recognition

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Julian Mack, Ekagra Ranjan, Walter Beller-Morales, Bharat venkitesh, and Pierre Richemond
Word Count
1,485
Language
-
Hacker News Points
-
Summary

Cohere-transcribe-03-2026 is a newly launched 2-billion-parameter speech recognition model from CohereLabs, designed to deliver state-of-the-art accuracy across 14 enterprise-critical languages and is open-sourced on Hugging Face under an Apache 2.0 license. The model outperforms existing proprietary and open-source competitors in English, taking the top spot on the Hugging Face Open ASR Leaderboard, and shows comparable or superior performance in the other 13 languages. Built with an encoder-decoder X-attention transformer architecture, the model emphasizes efficiency and accuracy by dedicating over 90% of its parameters to the encoder, allowing for minimal autoregressive inference compute. Cohere-transcribe was trained on 0.5 million hours of curated audio and transcripts, supplemented with synthetic data, and utilizes a multilingual tokenizer with byte fallback to handle varied language inputs. The model's production viability is enhanced through collaboration with vLLM for efficient, scalable deployment, achieving up to twice the throughput compared to similar models. Despite its strengths, the model is not specifically trained for code-switched audio and may require a noise gate or voice activity detection to avoid errors from non-speech sounds. Cohere-transcribe represents a significant step in Cohere's efforts to enhance audio experiences on their North enterprise platform, with the model available for experimentation via Hugging Face and Cohere's API.