How we solved latency at Vapi

Post Details

Company

Vapi

Date Published

July 14, 2025

Author

Abhishek Sharma

Word Count

778

Company Posts That Month

6

Language

English

Hacker News Points

-

Post removed?

No

Source URL

vapi.ai/blog/how-we-solved-latency-at-vapi

Summary

Latency significantly disrupts conversational flow in real-time voice applications, particularly when it exceeds 1200ms, leading to potential breaks in conversation. The primary latency bottleneck is often the Large Language Model (LLM) processing, with OpenAI's GPT-4o showing instability in response times across different periods and regions. To address these issues, a system was developed to dynamically route requests to the fastest available LLM deployment, initially using a brute-force approach by sending requests to all deployments, which proved costly. A polling system was then implemented to check deployment speeds every 10 minutes, but it failed to account for sudden latency spikes. The final solution involved using live data to update routing in real-time, segmenting traffic to explore other potential fast deployments while exploiting known fast ones. Despite these improvements, occasional hangs still occurred, necessitating a recovery mechanism to cancel and reroute slow requests based on dynamically set latency thresholds for each deployment. This approach successfully reduced latency but highlighted the complex infrastructure challenges in optimizing models like GPT-4o for real-time applications.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
LLM	3	4,152	612	181	+19%
Real-time	3	4,668	1,055	221	+15%
Voice AI	1	733	110	37	-16%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.