Accelerating Streaming STT Inference Through Custom Kernels

Company

Deepgram

Date Published

Oct. 28, 2025

Author

Josh Gevirtz

Word count

1676

Language

English

Hacker News points

None

URL

deepgram.com/learn/accelerating-streaming-stt-inference-through-custom-kernels

Summary

In their pursuit to enhance streaming Speech-to-Text (STT) systems, the authors developed custom kernels to address the challenges of balancing low latency and high concurrency without compromising accuracy. Traditional methods like "fast batching" were found to introduce issues such as latency and limited context in audio processing, prompting a shift towards a "true streaming" approach. This approach required handling asynchronous and incomplete data batches, leading to inefficiencies when using existing torch kernels for operations like scaled dot product attention. To overcome these inefficiencies, the team created custom CUDA kernels that streamline cache access and computation processes by fusing specific operations, thus eliminating unnecessary indexing and memory use. This innovation resulted in a significant reduction in compute time and an 80% improvement in concurrency compared to native torch methods, making the system more suitable for real-time applications.