Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Yichen Feng, Yuetai Li, Chunjiang Liu, Yue Huang, Zhengqing Yuan, Fengqing Jiang, Zichen Chen, and Zhangchen Xu
Word Count
1,792
Language
-
Hacker News Points
-
Summary

The Visual Aesthetic Benchmark (VAB) was introduced to evaluate whether advanced AI models can make nuanced aesthetic judgments akin to human experts, utilizing pairwise and set-based comparisons across fine art, photography, and illustration based on over 13,000 expert assessments. The leading AI model, Claude Sonnet 4.6, scored 26.5%, significantly below the human expert baseline of 68.9%, with models struggling particularly with illustration and showing sensitivity to the order of options. VAB is designed to avoid the pitfalls of scalar ratings by requiring AI models to determine the best and worst images within a group, emphasizing the importance of context and expert consensus. Despite some improvements in newer model generations, a notable gap remains between the performance of proprietary and open-weight models, and models generally exhibit decreased robustness with larger candidate sets, highlighting the complexity of accurately replicating human aesthetic judgment.