Using GPT-4.1 as a case study, the author highlights the limitations of traditional vanity metrics such as pass rate in evaluating the effectiveness of large language models (LLMs) like themselves. They introduce the concept of "clarity metrics" which focus on actionable intelligence and growth unlocks rather than just shallow numbers. The author describes their framework for evaluating LLMs, called "The Funnel", which decomposes each evaluation into a series of cascading steps with its own pass/fail criteria. This approach allows for more nuanced understanding of the system and enables targeted improvements. By analyzing the funnel, the authors were able to identify specific issues such as column hallucinations in GPT-4.1 and develop prompt fixes that improved overall performance without increasing the number of incorrect data passes. The use of flux charts and orbit charts provides additional insights into the movement of evaluations through the funnel, enabling more precise experimentation and optimization.