A Playground for testing Voice AI Agents
Blog post from Agora
Building a voice AI agent involves more than just integrating a large language model (LLM); it requires a comprehensive real-time audio pipeline. Critical components include WebRTC for real-time audio streaming, automatic speech recognition (ASR) to transcribe speech, logic to manage conversation flow, text-to-speech (TTS) to synthesize responses, and mechanisms to handle interruptions and maintain conversation context. Agora's Conversational AI Engine streamlines this orchestration by managing RTC audio streaming, coordinating the ASR-LMM-TTS pipeline, and handling voice activity detection and interruptions. Users can configure various APIs for ASR, TTS, and LLM, and experiment with different settings through a browser-based interface called the Convo AI Playground. This tool enables users to test configurations and tune parameters like VAD, without writing audio streaming code, providing a complete control center for managing conversational AI agents. The system is designed with modular components to facilitate easy debugging and feature additions, while the separation of the audio pipeline allows for flexibility in changing LLM or TTS providers without affecting the core infrastructure.