Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

A New Framework for Evaluating Voice Agents (EVA)

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Tara Bogavelli, Gabrielle Gauthier Melancon, Katrina Stankiewicz, Nifemi Bamgbose, Hoang Nguyen, Raghav Mehndiratta, Hari Subramani, and Fanny Riols
Word Count
2,147
Language
-
Hacker News Points
-
Summary

EVA is a comprehensive framework designed to evaluate conversational voice agents by examining both task accuracy and user experience in multi-turn spoken interactions. Unlike existing models that treat accuracy and conversational experience as separate entities, EVA integrates these dimensions, providing two primary scores: EVA-A for accuracy and EVA-X for experience. This framework uses a bot-to-bot audio architecture to simulate realistic conversations and evaluates agents with a suite of metrics, including deterministic code-based and LLM-as-Judge methods. EVA's findings reveal a consistent tradeoff between task completion and user experience, highlighting the need for a holistic approach to voice agent evaluation. It also identifies common failure modes, such as named entity transcription errors and complexities in multi-step workflows. Currently released with a dataset of airline scenarios, EVA plans to expand to diverse domains and conditions, aiming to enhance voice agent capabilities while addressing inherent limitations like biases in LLM-as-Judge models and domain-specific constraints.