This paper introduces a novel benchmark task called Illusory VQA (Visual Question Answering), which aims to test the perceptual capabilities of Vision-Language Models (VLMs) on visual illusions. The authors create four benchmark datasets, each targeting different aspects of visual illusion processing, and evaluate several state-of-the-art models using these datasets. They find that CLIP outperforms other models, including AIMv2 and SigLIP 2, in detecting visual illusions and answering questions about them. However, they also discover that reproducing results is harder than expected and that small implementation details can significantly impact model performance. The study highlights the importance of understanding and addressing perceptual limitations in AI systems, particularly in complex environments such as autonomous driving or medical diagnosis.