This Visual Illusions Benchmark Makes Me Question the Power of VLMs

Company

Voxel51

Date Published

March 4, 2025

Author

Harpreet Sahota

Word count

3905

Language

English

Hacker News points

None

URL

voxel51.com/blog/this-visual-illusions-benchmark-makes-me-question-the-power-of-vlms

Summary

This paper introduces a novel benchmark task called Illusory VQA (Visual Question Answering), which aims to test the perceptual capabilities of Vision-Language Models (VLMs) on visual illusions. The authors create four benchmark datasets, each targeting different aspects of visual illusion processing, and evaluate several state-of-the-art models using these datasets. They find that CLIP outperforms other models, including AIMv2 and SigLIP 2, in detecting visual illusions and answering questions about them. However, they also discover that reproducing results is harder than expected and that small implementation details can significantly impact model performance. The study highlights the importance of understanding and addressing perceptual limitations in AI systems, particularly in complex environments such as autonomous driving or medical diagnosis.