Namo-Turn-Detection-v1: Semantic Turn Detection for AI Voice Agents
Blog post from Video SDK
NAMO Turn Detector v1 (NAMO-v1) is an open-source, ONNX-optimized model designed to enhance real-time voice systems by predicting conversational boundaries through semantic understanding rather than relying solely on silence. This approach addresses the limitations of existing Voice Activity Detection (VAD) and Automatic Speech Recognition (ASR) endpointing methods, which often result in premature cut-offs or extended pauses. NAMO-v1 achieves under 19 ms inference time for specialized single-language models and under 29 ms for multilingual models, providing up to 97.3% accuracy, making it a practical replacement for VAD. The model offers multilingual robustness, operating across 23 languages without per-language tuning, and it utilizes Natural Language Understanding to analyze the context of speech, distinguishing between complete and incomplete utterances. This innovation allows for quicker, more natural responses in voice AI systems, reducing interruptions and ensuring consistency across different languages and markets, while being lightweight and production-ready for enterprise applications.