Best open source speech-to-text (STT) model in 2026 (with benchmarks)
Blog post from Northflank
In 2026, the leading open-source speech-to-text (STT) models include Canary Qwen 2.5B, IBM Granite Speech 3.3 8B, Whisper Large V3, Whisper Large V3 Turbo, Parakeet TDT, and Moonshine, each excelling in different areas such as accuracy, multilingual support, real-time processing, and edge deployment. These models are evaluated based on metrics like word error rate (WER), real-time factor (RTF), latency, supported languages, and model size, providing flexibility and cost advantages over commercial services. Canary Qwen 2.5B is noted for its high English accuracy, IBM Granite Speech for enterprise-grade applications, and Whisper Large V3 for its multilingual capabilities. Parakeet TDT is optimized for ultra-low latency streaming, while Moonshine is designed for mobile and edge devices. Deploying these models effectively on platforms like Northflank involves considerations of model size, VRAM usage, and the specific requirements of the application, such as speed, accuracy, and deployment environment. The choice between open source and commercial STT solutions often hinges on factors like cost, data privacy, customization needs, and the scale of deployment.