Company
Date Published
Author
Sri Chavali, Elizabeth Hutton, Aparna Dhinakaran
Word count
1364
Language
English
Hacker News points
None

Summary

When using large language models (LLMs) as evaluators, the inclusion of explanations and the use of chain-of-thought (CoT) prompting are crucial design choices that influence the quality and transparency of their judgments. Explanations enhance alignment with human judgments by reducing variance, exposing decision factors, and providing reusable data for retraining or improving models, while the order of explanations before or after labels has little effect on accuracy but affects the clarity of reasoning. CoT prompting, although widely adopted, shows mixed effectiveness and is most beneficial for tasks requiring complex reasoning steps, though it can increase complexity and costs in simpler tasks. Modern reasoning models, which perform internal deliberation, often outperform base models but come with trade-offs in latency and cost, making explicit CoT prompting less necessary. Therefore, explanations are recommended as part of the output to audit decisions and refine evaluation setups, with careful consideration of prompt design, score definitions, and bias mitigation strategies to ensure reliable evaluations.