Traces Are All You Need (to rank LLMs)

Post Details

Company

Fireworks AI

Date Published

Oct. 6, 2025

Author

-

Word Count

2,174

Language

English

Hacker News Points

-

Source URL

fireworks.ai/blog/traces-are-all-you-need

Summary

The article explains a method for quickly evaluating and ranking AI models using production data, specifically focusing on customer service conversations. It introduces Eval Protocol, an open-source toolkit that allows users to create an internal model leaderboard in minutes without requiring ground-truth labels. The process involves deconstructing conversations into test cases, generating new responses with challenger models, and using a large language model (LLM) as an impartial judge to perform pairwise comparisons. This method is validated by correlating its results with those from the Tau Bench Airline benchmark, demonstrating that it accurately identifies the best and worst-performing models. By leveraging production data, this approach provides a reliable, fast, and cost-effective way to determine the optimal AI model for specific use cases, allowing for smarter decision-making in model selection.