Real-time Speech to Text latency guide: Under 200 ms
Blog post from ElevenLabs
Real-time speech-to-text (STT) technology involves transcribing spoken words into text almost instantaneously, with the Scribe v2 Realtime model achieving partial transcriptions in approximately 150 milliseconds. Achieving low latency in STT systems is largely dependent on architectural considerations, including the choice of transport methods such as WebSocket for simplicity or WebRTC for real-time media handling, as well as effectively managing audio chunking and end-pointing processes. The article discusses the importance of distinguishing between provisional partial and committed final transcriptions to enhance user experience, and it highlights the role of Voice Activity Detection (VAD) and manual commit controls in segment finalization. Additionally, it emphasizes the significance of using appropriate audio formats, such as PCM, and small chunk sizes to reduce latency. By optimizing these various elements of the pipeline, developers can improve the performance of real-time STT systems, ensuring faster and more reliable transcriptions that are crucial for applications like voice agents and live captioning.
| Trend | Post Mentions | Total Month Mentions | Posts | Companies | MoM |
|---|---|---|---|---|---|
| Real-time | 44 | 5,457 | 1,338 | 238 | -5% |
| Voice AI | 5 | 2,232 | 214 | 48 | -36% |
| LLM | 2 | 5,172 | 1,006 | 220 | -43% |
| AI Model Fine-tuning | 1 | 694 | 169 | 62 | +13% |