Naively Training a Wake Word Model from Scratch
Blog post from Deepgram
In an innovative approach to developing a wake word detector, the DG Labs team bypassed traditional methods, which require extensive human voice data collection, in favor of using synthetic voices from services like Deepgram and ElevenLabs. This method capitalized on the diversity and quality of modern text-to-speech (TTS) technology, allowing for the creation of a robust training dataset at minimal cost and effort. The project, which aimed to create a model capable of recognizing the wake word "Zaphod," revealed challenges with false positives due to phonetic similarities with common English sounds. By employing strategic phonetic engineering and a high ratio of negative to positive examples, alongside extensive data augmentation techniques, the team significantly improved the model's accuracy, reducing false accept rates to below 10%. The process, which was completed in a fraction of the time and cost of traditional methods, underscores the value of synthetic data and the importance of negative training examples in achieving reliable wake word detection.