We have Mythos at Home: GLM 5.2 beats Claude in our Cyber Benchmarks
Blog post from Semgrep
An experiment conducted by Semgrep evaluated various open-source models against their IDOR benchmark to assess vulnerability-detection performance, revealing unexpected results. Among the models tested, Zhipu AI's GLM 5.2, an open-weight model, achieved a 39% F1 score, outperforming Claude Code's 32% score at a significantly lower cost of approximately $0.17 per vulnerability detected. However, it still lagged behind Semgrep's multimodal pipeline, which achieved 53–61% F1 scores with a more sophisticated harness. The test primarily aimed to discern how much of the performance was attributable to the model itself versus the harness—a critical question for security tasks leveraging AI. GLM 5.2, notable for its open-weight nature and cost-effectiveness, showed promise despite not having endpoint discovery support like the multimodal pipeline, indicating that open-weight models have become a viable consideration for security research. The experiment underscores the importance of harness configuration and represents a step forward in the competitiveness of open-weight models, although it also highlights that one successful outcome does not imply universal superiority across different tasks or datasets.