Home / Companies / Together AI / Blog / Post Details
Content Deep Dive

How Together AI built the world’s fastest speech-to-text stack

Blog post from Together AI

Post Details
Company
Date Published
Author
Together AI
Word Count
1,646
Language
English
Hacker News Points
-
Summary

Artificial Analysis highlights the complexities of serving Automatic Speech Recognition (ASR) systems, focusing on the challenges of processing audio data compared to text. While text inputs are compact and ready for inference, audio data requires extensive preprocessing before reaching the GPU, making ASR a full-path systems problem. The piece examines NVIDIA’s Parakeet-TDT 0.6B v3 and OpenAI’s Whisper Large v3 models, emphasizing the need for efficient GPU execution, CPU preprocessing, and memory management. Key optimizations include using TensorRT for encoder execution, conditional CUDA graphs to streamline decoder operations, and reducing CPU-path overhead with shared memory and evented I/O. The importance of controlling both median and tail latency in voice systems is underscored, as ASR latency sets the earliest bound on user-visible response time. Parakeet v3, noted for its expanded language support and training on a vast multilingual corpus, showcases advancements in ASR technology, demonstrating significant improvements over its predecessor in language support and model efficiency.