Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR
Blog post from HuggingFace
NVIDIA's Nemotron Speech ASR introduces a groundbreaking cache-aware streaming architecture that significantly enhances the efficiency and scalability of real-time Automatic Speech Recognition (ASR) systems. By leveraging FastConformer architecture and 8x downsampling, it processes only new audio "deltas," maximizing GPU throughput and minimizing redundant computations that traditionally plagued buffered inference models, which often led to latency drift and computational inefficiency. This innovative approach allows the model to maintain stable latency and high concurrency, supporting up to 560 concurrent streams on the NVIDIA H100, with dynamic, runtime-configurable latency modes. Nemotron Speech ASR's integration into real-world applications, such as those by Daily and Modal, demonstrates its ability to sustain low-latency, high-speed, and accurate speech recognition, setting a new standard for real-time voice agents that do not compromise on speed, accuracy, or scalability.