How we solved latency at Vapi
Blog post from Vapi
Latency significantly disrupts conversational flow in real-time voice applications, particularly when it exceeds 1200ms, leading to potential breaks in conversation. The primary latency bottleneck is often the Large Language Model (LLM) processing, with OpenAI's GPT-4o showing instability in response times across different periods and regions. To address these issues, a system was developed to dynamically route requests to the fastest available LLM deployment, initially using a brute-force approach by sending requests to all deployments, which proved costly. A polling system was then implemented to check deployment speeds every 10 minutes, but it failed to account for sudden latency spikes. The final solution involved using live data to update routing in real-time, segmenting traffic to explore other potential fast deployments while exploiting known fast ones. Despite these improvements, occasional hangs still occurred, necessitating a recovery mechanism to cancel and reroute slow requests based on dynamically set latency thresholds for each deployment. This approach successfully reduced latency but highlighted the complex infrastructure challenges in optimizing models like GPT-4o for real-time applications.