All that glitters: When “gold-like” answers mask functional failures on coding agent benchmarks

Post Details

Company

AI21 Labs

Date Published

April 14, 2026

Author

Oded Avraham, Algorithm Developer

Word Count

3,464

Language

English

Hacker News Points

-

Source URL

www.ai21.com/blog/gold-like-answers-benchmarks

Summary

In an experiment with an LLM Judge component in the Maestro framework, researchers discovered that the model exhibited a strong preference for solutions resembling "gold" answers—those that are minimal, clean, and focused—over solutions that were functionally correct but less polished. This preference appeared as a bias in the evaluation of coding agents using the SWE-bench benchmark, initially suspected of contamination, but persisted even with a newer, uncontaminated dataset. The study revealed that while LLMs can favor stylistic traits reminiscent of "gold" solutions, this can misalign with actual success criteria focused on functional correctness. To address this, the researchers developed more detailed prompt guidelines, prioritizing correctness and completeness over minimality and style, which successfully mitigated the bias. The investigation underscores the importance of considering both functional and non-functional qualities in evaluating coding agents and suggests that high performance on benchmarks may not always indicate true understanding but rather a learned preference for certain stylistic characteristics.