What GPT-4o Can't Code

Company

Riza

Date Published

Aug. 20, 2024

Author

Kyle Gray

Word count

1636

Language

English

Hacker News points

None

URL

riza.io/blog/what-gpt-4o-cant-code

Summary

Riza, a company focused on safely running untrusted code, explored the performance of OpenAI's GPT-4o model on the HumanEval benchmark, which consists of 164 Python programming problems. The model achieved a high Pass@10 rate of 97.0%, indicating it solved 159 problems within ten attempts, and a Pass@1 rate between 88% and 94% for individual samples. However, five problems consistently failed across ten attempts, highlighting areas where the model struggled with logic errors and misinterpretations of problem requirements. Specific cases are analyzed to understand these failures, such as incorrect handling of sentence delimiters, mismanagement of nested structures, and improper ordering based on digit sums. The examination aims to identify and rectify the errors, with the potential for improvements by feeding failure contexts back into the model for future iterations.