Company
Date Published
Author
-
Word count
1387
Language
English
Hacker News points
None

Summary

Latency in speech-to-text (STT) systems is crucial for delivering effective voice experiences, such as interactive agents and live captioning, by ensuring swift and accurate responses. Gladia's approach to measuring latency involves distinguishing between the time to first partial token (TTFB) and the time to final result, achieving sub-300 ms partial and ~700 ms final latency on 3-second utterances. This involves measuring at multiple milestones, such as audio capture start to first partial hypothesis, and controlling factors like frame size, endpointing thresholds, and network jitter to balance latency with stability and accuracy. Real-Time Factor (RTF) is used to assess throughput and capacity planning, ensuring systems can handle live audio without delays. When benchmarking, it's essential to maintain consistent conditions and report metrics like P50, P95, and P99 for each milestone, rather than a single blended value, to capture the nuances of latency across different scenarios.