Snyk VulnBench JS 1.0: Can LLMs Find the Same Bugs Twice?
Blog post from Snyk
A study of 300 vulnerability-finding scans using Snyk VulnBench JS 1.0 reveals varied repeatability in security reviews conducted by agentic language models (LLMs) on JavaScript code. While LLMs can identify familiar exploit shapes that sometimes align with Snyk Code findings, their reports on vulnerabilities outside the reference set were inconsistent and less repeatable. Across repeated runs, reference-matched findings were more stable, appearing consistently, whereas unique unmatched findings often varied, with nearly half appearing in only one of five runs. The benchmark suggests a complementary approach, combining LLMs and SAST tools like Snyk Code, as they uncover different security gaps and failure modes, with deterministic SAST offering more reliable coverage of systematic vulnerabilities. Cost and performance analysis indicated that more expensive LLM configurations did not necessarily correlate with better vulnerability detection, emphasizing the need for efficient, combined security workflows to enhance reliability. Future benchmarks aim to incorporate broader application structures and independent ground truth sources to further refine the understanding of LLM and SAST capabilities in security reviews.
No tracked trend matches for this post yet.