All that glitters: When “gold-like” answers mask functional failures on coding agent benchmarks
Blog post from AI21 Labs
In an experiment with an LLM Judge component in the Maestro framework, researchers discovered that the model exhibited a strong preference for solutions resembling "gold" answers—those that are minimal, clean, and focused—over solutions that were functionally correct but less polished. This preference appeared as a bias in the evaluation of coding agents using the SWE-bench benchmark, initially suspected of contamination, but persisted even with a newer, uncontaminated dataset. The study revealed that while LLMs can favor stylistic traits reminiscent of "gold" solutions, this can misalign with actual success criteria focused on functional correctness. To address this, the researchers developed more detailed prompt guidelines, prioritizing correctness and completeness over minimality and style, which successfully mitigated the bias. The investigation underscores the importance of considering both functional and non-functional qualities in evaluating coding agents and suggests that high performance on benchmarks may not always indicate true understanding but rather a learned preference for certain stylistic characteristics.