Can Your LLM Think Like a Professional? Introducing ProfBench

Post Details

Company

HuggingFace

Date Published

Oct. 28, 2025

Author

Zhilin Wang, Jaehun Jung, Ximing Lu, Shizhe Diao, jiaqiz, VivienneZhang, Nik Spirin, and Dong

Word Count

1,337

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/nvidia/profbench

Summary

ProfBench is a new benchmark designed to test large language models (LLMs) on complex, open-ended tasks requiring professional-grade knowledge across domains like Finance, Chemistry, and Physics, aiming to evaluate AI's ability to handle nuanced reasoning tasks similar to those of PhD or MBA professionals. Supported by the NVIDIA NeMo Evaluator SDK, ProfBench contains over 7,000 response-criterion pairs designed by experts to assess models on three key dimensions: data extraction, reasoning, and style. The benchmark highlights the challenges current AI models face, with top performers like GPT-5-High scoring significantly lower than human experts, particularly in domains such as Physics. By providing a robust, rubric-based evaluation framework, ProfBench seeks to advance the development of AI systems capable of tackling real-world professional challenges, serving as a critical tool for both the open-source community and enterprise users.