Home / Companies / Factory / Blog / Post Details
Content Deep Dive

Legacy-Bench: Can AI Agents Maintain the World's Most Critical Software?

Blog post from Factory

Post Details
Company
Date Published
Author
Leo Tchourakov, Abhay Singhal, Eno Reyes
Word Count
1,882
Language
English
Hacker News Points
-
Summary

Legacy-Bench is a new benchmark designed to evaluate AI agents' capabilities in handling legacy software engineering tasks across six language families, including COBOL, Fortran, and Java 7, which are foundational yet increasingly challenging due to retiring engineers and complex business rules embedded within the code. It provides a comprehensive set of tasks that involve fixing bugs, implementing new functionalities, and migrating code, reflecting real-world applications in critical infrastructure. The benchmark reveals significant performance variations among AI models, with agents excelling in bug fixing due to visible errors in languages like Java 7 but struggling with COBOL due to its silent errors and complex format precision requirements. The results indicate a steep learning curve in reading and writing new legacy code, with migration success heavily dependent on the target language. No single model consistently outperforms across all tasks, highlighting diverse strengths and weaknesses, and illustrating the need for systematic verification and iteration in legacy environments. As AI models improve their legacy language training and self-verification capabilities, the performance gap between legacy and modern benchmarks is expected to narrow, offering insights for those modernizing legacy systems.