Testing Binary vs Score Evals on the Latest Models

Company

Arize

Date Published

Sept. 24, 2025

Author

Sri Chavali

Word count

1935

Language

English

Hacker News points

None

URL

arize.com/blog/testing-binary-vs-score-llm-evals-on-the-latest-models

Summary

In the exploration of evaluation methods for large language models (LLMs), the study compares binary and score-based evaluations, revealing that while numeric scoring offers granular detail, it suffers from instability, with scores often collapsing into broad bands or plateaus, particularly in cases of spelling and structural errors. The 2025 tests, which utilized advanced models like GPT-5-nano and others, showed some improvements in consistency over previous years, but also confirmed persistent challenges in using numeric scales, as these often lack the reliability needed for nuanced judgments. Binary and multi-categorical rubrics, such as letter grades, provide more stable and reproducible results, aligning better with human annotations, though they sacrifice some sensitivity to finer distinctions. The research underscores a critical trade-off between stability and resolution in LLM evaluations, suggesting that while binary and categorical approaches are more consistent, numeric scores require tightly controlled conditions to be effective.