Automatic Speech Recognition (ASR): How Speech-to-Text Models Work—and Which One to Use
Blog post from Gladia
Automatic speech recognition (ASR), also known as Speech-to-Text (STT) technology, is a rapidly advancing field crucial for applications like voice assistants, transcription tools, and real-time communication. In late 2025, Bruno Hays from Gladia analyzed various ASR architectures to guide model selection, focusing on the modern architectures such as encoder-decoder, CTC, encoder-transducer, and speech large language models (LLMs). These architectures have distinct tradeoffs in terms of speed, accuracy, and data requirements, with each suited for different use cases. For instance, encoder-decoder models like Whisper are robust against noisy data, while models like Wave2Vec2 excel in fine-tuning capabilities. The Kyutai-STT model introduces delayed streams modeling for real-time interaction, and NVIDIA's Nemotron-Speech-Streaming-En-0.6B uses a Cache-Aware FastConformer encoder for minimal latency. Selecting the appropriate ASR model involves considering factors like word error rate, end goals, input audio type, and performance requirements, as there is no one-size-fits-all solution.