Sequential Pipeline Architecture for Voice Agents
Blog post from LiveKit
The sequential pipeline is the core architecture behind modern voice agents, processing audio through a series of specialized stages: Voice Activity Detection (VAD), Speech-to-Text (STT), Large Language Model (LLM), Text-to-Speech (TTS), and Audio Transport. This architecture allows for modularity and independent testing, with each stage transforming and passing data to the next. Streaming processes at each stage reduce latency, crucial for natural, conversational interactions. While the sequential pipeline is the default for its control and transparency, alternative Speech-to-Speech (S2S) models offer reduced latency but less granular control. The pipeline's modular design accommodates component swapping and tool integrations, enhancing functionality and adaptability. LiveKit's framework supports this architecture with easy setup and optimizations, ensuring low-latency and robust voice agent deployments, while allowing developers to explore various configurations and advanced multi-agent patterns built upon this foundational structure.