One-Second Voice-to-Voice Latency with Modal, Pipecat, and Open Models

Post Details

Company

Modal

Date Published

Nov. 4, 2025

Author

Ben Shababo

Word Count

2,683

Language

English

Hacker News Points

-

Source URL

modal.com/blog/low-latency-voice-bot

Summary

In a detailed exploration of building a low-latency voice AI chatbot, the article outlines the use of Modal, open-source frameworks like Pipecat, and open models to achieve near real-time conversational capabilities. The chatbot architecture relies on a sequence of AI models for speech-to-text, language processing, and text-to-speech tasks, coordinated by Pipecat's voice AI framework, which supports modularity and stateful conversation management. The integration with Modal's infrastructure allows for efficient autoscaling and resource management, optimizing the use of CPUs and GPUs to keep costs low while maintaining performance. The system achieves voice-to-voice latencies around one second by leveraging Pipecat's SmallWebRTCTransport for peer-to-peer connections, Modal Tunnels for reduced network latency, and a careful selection of AI models like Parakeet for STT, Qwen3 for LLM, and Kokoro for TTS. The article also addresses strategies for minimizing latency through geographic proximity of services and discusses the challenges and solutions for maintaining performance across distributed components. Additionally, it highlights the integration of technologies for speaker diarization and analysis to fine-tune latency measurements, with practical implementation details available in a GitHub repository.