Humanity's Last Exam: AI vs Human Benchmark Results

Post Details

Company

Galileo

Date Published

Aug. 1, 2025

Author

Conor Bronsdon

Word Count

1,963

Language

English

Hacker News Points

-

Source URL

galileo.ai/blog/humanitys-last-exam-ai-benchmark

Summary

Humanity's Last Exam (HLE) is a comprehensive AI benchmark designed to evaluate the genuine reasoning capabilities of AI systems across a broad range of academic disciplines. Developed by the Center for AI Safety with input from numerous subject-matter experts, HLE consists of 2,500-3,000 graduate-level questions that challenge AI's ability to reason rather than rely on pattern recognition or factual recall. The benchmark reveals a significant performance gap, with AI models scoring below 30% while human experts achieve nearly 90%, underscoring AI's current limitations in complex reasoning and multi-modal analysis. HLE serves as a critical tool for understanding AI's capabilities and deficiencies, offering insights that guide the responsible deployment of AI in real-world applications. By focusing on rigorous evaluation methods, HLE helps differentiate between AI's memorization skills and true understanding, highlighting the need for ongoing development and the integration of advanced testing frameworks like Galileo to ensure AI reliability and trustworthiness.