Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

Post Details

Company

HuggingFace

Date Published

Feb. 25, 2026

Author

Yichen Feng, Yuetai Li, Chunjiang Liu, Yue Huang, Zhengqing Yuan, Fengqing Jiang, Zichen Chen, and Zhangchen Xu

Word Count

1,792

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/zhangchenxu/visual-aesthetic-benchmark

Summary

The Visual Aesthetic Benchmark (VAB) was introduced to evaluate whether advanced AI models can make nuanced aesthetic judgments akin to human experts, utilizing pairwise and set-based comparisons across fine art, photography, and illustration based on over 13,000 expert assessments. The leading AI model, Claude Sonnet 4.6, scored 26.5%, significantly below the human expert baseline of 68.9%, with models struggling particularly with illustration and showing sensitivity to the order of options. VAB is designed to avoid the pitfalls of scalar ratings by requiring AI models to determine the best and worst images within a group, emphasizing the importance of context and expert consensus. Despite some improvements in newer model generations, a notable gap remains between the performance of proprietary and open-weight models, and models generally exhibit decreased robustness with larger candidate sets, highlighting the complexity of accurately replicating human aesthetic judgment.