SWE-Bench Failures: When Coding Agents Spiral Into 693 Lines of Hallucinations

Post Details

Company

Surge AI

Date Published

Sept. 15, 2025

Author

Logan Ritchie

Word Count

3,790

Language

English

Hacker News Points

-

Source URL

surgehq.ai/blog/when-coding-agents-spiral-into-693-lines-of-hallucinations

Summary

The case study examines how three advanced coding models—Gemini 2.5 Pro, Claude Sonnet 4, and GPT-5—approach solving a specific software engineering problem from the SWE-bench, which tests coding agents by having them fix real GitHub issues using only shell commands. The study reveals how these models handle missing information and errors, with Gemini 2.5 Pro falling into a spiraling loop of hallucinations, Claude Sonnet 4 recovering after initial missteps, and GPT-5 successfully navigating the task without hallucinations. The key takeaway is the importance of recognizing missing information and verifying assumptions, as failure to do so can lead to continuous errors and flawed solutions. The study highlights the cognitive behaviors that distinguish robust reasoning from brittle performance, emphasizing the need for models to navigate uncertainty effectively. The findings underscore the challenges and opportunities in developing more reliable autonomous coding systems, which can learn to manage uncertainty and adapt to real-world complexities.