The article explains a method for quickly evaluating and ranking AI models using production data, specifically focusing on customer service conversations. It introduces Eval Protocol, an open-source toolkit that allows users to create an internal model leaderboard in minutes without requiring ground-truth labels. The process involves deconstructing conversations into test cases, generating new responses with challenger models, and using a large language model (LLM) as an impartial judge to perform pairwise comparisons. This method is validated by correlating its results with those from the Tau Bench Airline benchmark, demonstrating that it accurately identifies the best and worst-performing models. By leveraging production data, this approach provides a reliable, fast, and cost-effective way to determine the optimal AI model for specific use cases, allowing for smarter decision-making in model selection.