Phare LLM benchmark V2: Reasoning models don't guarantee better security
Blog post from HuggingFace
Phare V2, an independent benchmark, evaluates AI models on hallucination, bias, harmfulness, and vulnerability to jailbreaking attacks, revealing that improved reasoning in models does not necessarily enhance their security or robustness. This expanded version includes reasoning models to assess their impact on AI safety, but findings indicate that advancements in reasoning capabilities do not correlate with better resistance to vulnerabilities such as bias, hallucination, and harmful content generation. The evaluation shows that despite improvements in handling complex tasks, security enhancements of new AI models have stagnated, with some newer models performing no better than their predecessors from 1.5 years ago. Notably, Anthropic models demonstrate superior jailbreak resistance, unlike Google's, where only the Gemini 3.0 Pro scores relatively high. The study also finds no significant correlation between model size and jailbreak resistance, and highlights that larger models are not consistently less biased. The analysis emphasizes the need for focused safety research and engineering investment, as safety does not automatically improve with model capability advancements. The Phare V2 evaluation underscores the importance of independent safety assessments and multilingual testing to ensure robust AI system deployments across varied languages and cultural contexts.