How to measure latency in speech-to-text (TTFB, Partials, Finals, RTF): A deep dive

Post Details

Company

Gladia

Date Published

Sept. 30, 2025

Author

-

Word Count

1,387

Language

English

Hacker News Points

-

Source URL

www.gladia.io/blog/measuring-latency-in-stt

Summary

Latency in speech-to-text (STT) systems is crucial for delivering effective voice experiences, such as interactive agents and live captioning, by ensuring swift and accurate responses. Gladia's approach to measuring latency involves distinguishing between the time to first partial token (TTFB) and the time to final result, achieving sub-300 ms partial and ~700 ms final latency on 3-second utterances. This involves measuring at multiple milestones, such as audio capture start to first partial hypothesis, and controlling factors like frame size, endpointing thresholds, and network jitter to balance latency with stability and accuracy. Real-Time Factor (RTF) is used to assess throughput and capacity planning, ensuring systems can handle live audio without delays. When benchmarking, it's essential to maintain consistent conditions and report metrics like P50, P95, and P99 for each milestone, rather than a single blended value, to capture the nuances of latency across different scenarios.