In their pursuit to enhance streaming Speech-to-Text (STT) systems, the authors developed custom kernels to address the challenges of balancing low latency and high concurrency without compromising accuracy. Traditional methods like "fast batching" were found to introduce issues such as latency and limited context in audio processing, prompting a shift towards a "true streaming" approach. This approach required handling asynchronous and incomplete data batches, leading to inefficiencies when using existing torch kernels for operations like scaled dot product attention. To overcome these inefficiencies, the team created custom CUDA kernels that streamline cache access and computation processes by fusing specific operations, thus eliminating unnecessary indexing and memory use. This innovation resulted in a significant reduction in compute time and an 80% improvement in concurrency compared to native torch methods, making the system more suitable for real-time applications.