Home / Companies / Vapi / Blog / Post Details
Content Deep Dive

How We Built Vapi's Voice AI Pipeline: Part 1

Blog post from Vapi

Post Details
Company
Date Published
Author
Abhishek Sharma
Word Count
871
Language
English
Hacker News Points
-
Summary

Voice AI systems have traditionally been hindered by what is known as the Batch Processing Cascade, which creates a robotic interaction experience due to its sequential processing of Speech-to-Text, Large Language Model, and Text-to-Speech steps, resulting in latency and disjointed conversations. To address this, a new approach that processes audio in real-time streams has been developed, allowing for a more natural and continuous conversational flow. This streaming architecture involves three parallel streams: the Audio Input Stream, which processes audio in 20ms chunks; the Transcription Stream, providing partial transcription results; and the Response Generation Stream, which generates responses incrementally and adapts to user input dynamically. The complexity lies in coordinating these streams to handle pauses, interruptions, and other real-world audio challenges, requiring intelligent decision-making based on partial information. This new method represents a significant shift from traditional architecture, aiming to enhance the responsiveness and fluidity of voice AI interactions.