Company
Date Published
Author
Marcus
Word count
1971
Language
English
Hacker News points
None

Summary

Smart Turn v2 is an updated version of the open-source voice activity detection (VAD) model that enhances conversation dynamics by accurately detecting when a speaker has finished talking using both semantic and vocal cues. This version supports 14 languages, is over six times smaller in size than its predecessor, and offers three times faster inference speeds. It utilizes native audio input, focusing on intonation and pace rather than just transcription, and is trained with a blend of human and synthetic data. Hosted by the Pipecat framework, the model is available for local and cloud-based inference, achieving around 99% accuracy on human-provided datasets and boasting high performance on various hardware setups. The update seeks to improve user interaction with AI agents by preventing interruptions during conversations, with a focus on natural language processing that includes recognition of filler words often overlooked by transcription models. The model's development involved extensive experimentation with different architectures and training on multilingual datasets, with ongoing efforts to refine the dataset quality to further improve accuracy.