Real-time Speech to Text latency guide: Under 200 ms

Post Details

Company

ElevenLabs

Date Published

June 25, 2026

Author

-

Word Count

4,472

Company Posts That Month

39

Language

English

Hacker News Points

-

Source URL

elevenlabs.io/blog/real-time-speech-to-text-under-200ms

Summary

Real-time speech-to-text (STT) technology involves transcribing spoken words into text almost instantaneously, with the Scribe v2 Realtime model achieving partial transcriptions in approximately 150 milliseconds. Achieving low latency in STT systems is largely dependent on architectural considerations, including the choice of transport methods such as WebSocket for simplicity or WebRTC for real-time media handling, as well as effectively managing audio chunking and end-pointing processes. The article discusses the importance of distinguishing between provisional partial and committed final transcriptions to enhance user experience, and it highlights the role of Voice Activity Detection (VAD) and manual commit controls in segment finalization. Additionally, it emphasizes the significance of using appropriate audio formats, such as PCM, and small chunk sizes to reduce latency. By optimizing these various elements of the pipeline, developers can improve the performance of real-time STT systems, ensuring faster and more reliable transcriptions that are crucial for applications like voice agents and live captioning.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Real-time	44	5,457	1,338	238	-5%
Voice AI	5	2,232	214	48	-36%
LLM	2	5,172	1,006	220	-43%
AI Model Fine-tuning	1	694	169	62	+13%