Voice AI agents are transforming customer support, sales, and automated assistance by providing a more natural interface than traditional chatbots, but they introduce complex evaluation challenges like speech quality, conversational flow, and latency. Unlike text-based agents, voice agents must handle background noise, varying accents, and real-time interruptions, requiring comprehensive evaluation across multiple components, such as speech-to-text, natural language understanding, decision logic, response generation, and text-to-speech. Evaluations should measure aspects like speech recognition accuracy, intent classification, response quality, latency, task completion, and user satisfaction, using both offline and online methods to ensure robustness across diverse languages and accents. Continuous monitoring and improvement are crucial to adapting to real-world conditions, as production environments can reveal unexpected issues that offline testing might miss. By systematically refining evaluation datasets and scorers, voice AI agents can be optimized for efficiency, accuracy, and user satisfaction across different contexts and languages.