How We Built Adaptive Background Speech Filtering at Vapi
Blog post from Vapi
Background speech interference remains a challenging problem for denoisers, which are designed to preserve human speech but struggle to differentiate between a primary speaker and background media such as a TV. An initial attempt to solve this by training an AI model faced issues with latency, context loss, and cost. Instead, a novel approach was developed using signal analysis to identify the unique acoustic characteristics of broadcast audio, such as consistent volume levels and sustained energy patterns. This led to the creation of Fourier Denoising, an adaptive system that dynamically adjusts to environmental acoustic profiles using techniques like rolling window analysis and dynamic offset. This system automatically switches to more aggressive filtering settings when media patterns are detected and reverts when they cease. Tested in various environments, it achieved significant reductions in background interference, notably in home and call center settings, while maintaining low latency. Though effective in specific scenarios, it is less suitable for dynamic environments and headphone users. As an experimental feature, Fourier Denoising allows for parameter tuning and holds potential for further enhancements.