Home / Companies / AI21 Labs / Blog / Post Details
Content Deep Dive

All that glitters: When “gold-like” answers mask functional failures on coding agent benchmarks

Blog post from AI21 Labs

Post Details
Company
Date Published
Author
Oded Avraham, Algorithm Developer
Word Count
3,464
Language
English
Hacker News Points
-
Summary

In an experiment with an LLM Judge component in the Maestro framework, researchers discovered that the model exhibited a strong preference for solutions resembling "gold" answers—those that are minimal, clean, and focused—over solutions that were functionally correct but less polished. This preference appeared as a bias in the evaluation of coding agents using the SWE-bench benchmark, initially suspected of contamination, but persisted even with a newer, uncontaminated dataset. The study revealed that while LLMs can favor stylistic traits reminiscent of "gold" solutions, this can misalign with actual success criteria focused on functional correctness. To address this, the researchers developed more detailed prompt guidelines, prioritizing correctness and completeness over minimality and style, which successfully mitigated the bias. The investigation underscores the importance of considering both functional and non-functional qualities in evaluating coding agents and suggests that high performance on benchmarks may not always indicate true understanding but rather a learned preference for certain stylistic characteristics.