Elo ratings beyond arena-style evaluations

Company

Cohere

Date Published

Aug. 15, 2025

Author

Research

Word count

2989

Language

English

Hacker News points

None

URL

cohere.com/blog/elo-ratings-beyond-arena-style-evaluations

Summary

The document discusses the challenges and limitations of using the Elo rating system for evaluating large language models (LLMs), emphasizing its issues with volatility, order dependence, and the non-transitive nature of multidimensional systems. Elo, traditionally used in chess, encounters difficulties in LLM evaluation due to the subjective, culturally dependent, and multidimensional nature of language tasks. The piece highlights the insufficiency of traditional leaderboards that use Elo for ranking LLMs, as they often obscure nuanced model capabilities by averaging performance across diverse tasks. To address these issues, Cohere Labs proposes a more robust evaluation system that combines offline pseudo-pairwise comparisons with the Bradley-Terry model, ensuring stable and interpretable rankings. The text also critiques open evaluation platforms for their susceptibility to strategic manipulation and suggests improvements for transparency and fairness in leaderboard designs, such as integrating additional metrics and ensuring balanced representation across languages and tasks. The article concludes by inviting engagement through Cohere's Catalyst Research Grants and emphasizes the importance of a collaborative approach in developing better LLM evaluation systems.