Home / Companies / LiveKit / Blog / Post Details
Content Deep Dive

Sequential Pipeline Architecture for Voice Agents

Blog post from LiveKit

Post Details
Company
Date Published
Author
Jesse Hall
Word Count
3,075
Language
English
Hacker News Points
-
Summary

The sequential pipeline is the core architecture behind modern voice agents, processing audio through a series of specialized stages: Voice Activity Detection (VAD), Speech-to-Text (STT), Large Language Model (LLM), Text-to-Speech (TTS), and Audio Transport. This architecture allows for modularity and independent testing, with each stage transforming and passing data to the next. Streaming processes at each stage reduce latency, crucial for natural, conversational interactions. While the sequential pipeline is the default for its control and transparency, alternative Speech-to-Speech (S2S) models offer reduced latency but less granular control. The pipeline's modular design accommodates component swapping and tool integrations, enhancing functionality and adaptability. LiveKit's framework supports this architecture with easy setup and optimizations, ensuring low-latency and robust voice agent deployments, while allowing developers to explore various configurations and advanced multi-agent patterns built upon this foundational structure.