Home / Companies / ElevenLabs / Blog / Post Details
Content Deep Dive

Real-time Speech to Text latency guide: Under 200 ms

Blog post from ElevenLabs

Post Details
Company
Date Published
Author
-
Word Count
4,472
Company Posts That Month
39
Language
English
Hacker News Points
-
Summary

Real-time speech-to-text (STT) technology involves transcribing spoken words into text almost instantaneously, with the Scribe v2 Realtime model achieving partial transcriptions in approximately 150 milliseconds. Achieving low latency in STT systems is largely dependent on architectural considerations, including the choice of transport methods such as WebSocket for simplicity or WebRTC for real-time media handling, as well as effectively managing audio chunking and end-pointing processes. The article discusses the importance of distinguishing between provisional partial and committed final transcriptions to enhance user experience, and it highlights the role of Voice Activity Detection (VAD) and manual commit controls in segment finalization. Additionally, it emphasizes the significance of using appropriate audio formats, such as PCM, and small chunk sizes to reduce latency. By optimizing these various elements of the pipeline, developers can improve the performance of real-time STT systems, ensuring faster and more reliable transcriptions that are crucial for applications like voice agents and live captioning.

Trends Found in this Post
Trend Post Mentions Total Month Mentions Posts Companies MoM
Real-time 44 5,457 1,338 238 -5%
Voice AI 5 2,232 214 48 -36%
LLM 2 5,172 1,006 220 -43%
AI Model Fine-tuning 1 694 169 62 +13%