Home / Companies / Surge AI / Blog / Post Details
Content Deep Dive

SWE-Bench Failures: When Coding Agents Spiral Into 693 Lines of Hallucinations

Blog post from Surge AI

Post Details
Company
Date Published
Author
Logan Ritchie
Word Count
3,790
Language
English
Hacker News Points
-
Summary

The case study examines how three advanced coding models—Gemini 2.5 Pro, Claude Sonnet 4, and GPT-5—approach solving a specific software engineering problem from the SWE-bench, which tests coding agents by having them fix real GitHub issues using only shell commands. The study reveals how these models handle missing information and errors, with Gemini 2.5 Pro falling into a spiraling loop of hallucinations, Claude Sonnet 4 recovering after initial missteps, and GPT-5 successfully navigating the task without hallucinations. The key takeaway is the importance of recognizing missing information and verifying assumptions, as failure to do so can lead to continuous errors and flawed solutions. The study highlights the cognitive behaviors that distinguish robust reasoning from brittle performance, emphasizing the need for models to navigate uncertainty effectively. The findings underscore the challenges and opportunities in developing more reliable autonomous coding systems, which can learn to manage uncertainty and adapt to real-world complexities.