A New Framework for Evaluating Voice Agents (EVA)
Blog post from HuggingFace
EVA is a comprehensive framework designed to evaluate conversational voice agents by examining both task accuracy and user experience in multi-turn spoken interactions. Unlike existing models that treat accuracy and conversational experience as separate entities, EVA integrates these dimensions, providing two primary scores: EVA-A for accuracy and EVA-X for experience. This framework uses a bot-to-bot audio architecture to simulate realistic conversations and evaluates agents with a suite of metrics, including deterministic code-based and LLM-as-Judge methods. EVA's findings reveal a consistent tradeoff between task completion and user experience, highlighting the need for a holistic approach to voice agent evaluation. It also identifies common failure modes, such as named entity transcription errors and complexities in multi-step workflows. Currently released with a dataset of airline scenarios, EVA plans to expand to diverse domains and conditions, aiming to enhance voice agent capabilities while addressing inherent limitations like biases in LLM-as-Judge models and domain-specific constraints.