How We Took Aura-2’s TTFB from <200ms to 90ms: Engineering Real-Time Voice AI at Scale

Post Details

Company

Deepgram

Date Published

Nov. 14, 2025

Author

Adam Sypniewski

Word Count

1,455

Language

English

Hacker News Points

-

Source URL

deepgram.com/learn/engineering-real-time-low-latency-voice-ai-at-scale

Summary

Deepgram achieved significant improvements in Aura-2's real-time text-to-speech (TTS) system by reengineering the runtime for parallelism and orchestration rather than expanding hardware, resulting in consistent sub-200ms latency, with steady-state conditions around 90ms. The focus was on addressing the challenges of time to first byte (TTFB) and concurrency in the TTS process, ensuring that each GPU was fully utilized without bottlenecks through innovations such as workload partitioning and dynamic orchestration. By isolating prompt processing from audio synthesis and using advanced GPU scheduling and memory management techniques, Aura-2 was able to support high concurrency and maintain low latency, even under increased load, without escalating costs or complexity. These advancements were rooted in a systems foundation built with Rust, allowing for fine-grained orchestration and efficiency. The result is a TTS system that offers faster response times and greater scalability, proving that strategic engineering can outperform the traditional method of merely adding more hardware.