Evaluating voice AI systems presents challenges as traditional metrics like Word Error Rate (WER) often fail to capture the nuances of human communication, such as tone, pacing, and context. This misalignment can lead to selecting models that perform well on paper but do not satisfy user needs in real interactions. To address this, custom evaluation frameworks tailored to specific use cases are recommended, focusing on metrics like entity accuracy in customer support or verbatim accuracy in medical dictation. Additionally, incorporating subjective "vibe evaluations," where testers gauge the naturalness and emotional tone of interactions, can highlight issues that quantitative metrics might miss. A comprehensive evaluation process should balance traditional metrics, custom metrics aligned with product goals, and qualitative feedback to ensure voice AI systems meet user expectations and enhance user experience.