Company
Date Published
Author
Sri Chavali
Word count
1935
Language
English
Hacker News points
None

Summary

In the exploration of evaluation methods for large language models (LLMs), the study compares binary and score-based evaluations, revealing that while numeric scoring offers granular detail, it suffers from instability, with scores often collapsing into broad bands or plateaus, particularly in cases of spelling and structural errors. The 2025 tests, which utilized advanced models like GPT-5-nano and others, showed some improvements in consistency over previous years, but also confirmed persistent challenges in using numeric scales, as these often lack the reliability needed for nuanced judgments. Binary and multi-categorical rubrics, such as letter grades, provide more stable and reproducible results, aligning better with human annotations, though they sacrifice some sensitivity to finer distinctions. The research underscores a critical trade-off between stability and resolution in LLM evaluations, suggesting that while binary and categorical approaches are more consistent, numeric scores require tightly controlled conditions to be effective.