Finding vulnerabilities in modern web apps using Claude Code and OpenAI Codex
Blog post from Semgrep
The evaluation of AI coding agents, specifically Anthropic's Claude Code and OpenAI Codex, revealed their potential to identify vulnerabilities in real-world Python web applications, albeit with significant limitations. The research, conducted on 11 large open-source projects, showed that Claude Code identified 46 vulnerabilities with a true positive rate (TPR) of 14%, while Codex found 21 vulnerabilities with an 18% TPR, highlighting a high false positive rate in both. The agents demonstrated proficiency in detecting specific vulnerabilities like Insecure Direct Object References (IDOR) but struggled with more complex issues such as SQL Injection and Cross-Site Scripting (XSS) due to challenges in tracing data flows across multiple files and functions. The non-determinism of AI agents, leading to inconsistent results across repeated analyses, poses a significant challenge in ensuring comprehensive vulnerability detection. Despite these challenges, the research underscores the potential of AI tools to complement traditional security practices by providing contextual insights and suggests that a combination of AI-driven analysis and traditional static analysis could enhance security tooling effectiveness.