Home / Companies / Semgrep / Blog / Post Details
Content Deep Dive

We have Mythos at Home: GLM 5.2 beats Claude in our Cyber Benchmarks

Blog post from Semgrep

Post Details
Company
Date Published
Author
Katie Paxton-Fear, Seth Jaksik, Brenden Noblitt, Erik Buchanan
Word Count
2,117
Company Posts That Month
10
Language
English
Hacker News Points
-
Summary

An experiment conducted by Semgrep evaluated various open-source models against their IDOR benchmark to assess vulnerability-detection performance, revealing unexpected results. Among the models tested, Zhipu AI's GLM 5.2, an open-weight model, achieved a 39% F1 score, outperforming Claude Code's 32% score at a significantly lower cost of approximately $0.17 per vulnerability detected. However, it still lagged behind Semgrep's multimodal pipeline, which achieved 53–61% F1 scores with a more sophisticated harness. The test primarily aimed to discern how much of the performance was attributable to the model itself versus the harness—a critical question for security tasks leveraging AI. GLM 5.2, notable for its open-weight nature and cost-effectiveness, showed promise despite not having endpoint discovery support like the multimodal pipeline, indicating that open-weight models have become a viable consideration for security research. The experiment underscores the importance of harness configuration and represents a step forward in the competitiveness of open-weight models, although it also highlights that one successful outcome does not imply universal superiority across different tasks or datasets.

Trends Found in this Post
Trend Post Mentions Total Month Mentions Posts Companies MoM
LLM 5 5,172 1,006 220 -43%
AI Agents 1 4,874 1,103 240 -1%