The Best of ICCV 2025 Day 2: Advancing Vision Language Models
Blog post from Voxel51
The second day of the ICCV 2025 conference highlights innovative research that addresses real-world challenges through the advancement of vision language models (VLMs). The work discussed includes "MINDCUBE," which evaluates VLMs' ability to form spatial mental models from limited viewpoints, revealing their struggle with spatial reasoning despite object recognition prowess. "SGBD" presents a training strategy that enhances the robustness of multimodal recommender systems amidst noisy data, significantly improving recommendation accuracy. "Sari Sandbox" introduces a virtual retail environment for training embodied AI agents, uncovering the complexities of retail-specific tasks that current models fail to handle efficiently. Lastly, a novel approach for predicting air quality from sky images is explored, demonstrating the potential for a scalable alternative to traditional sensor networks while making air quality data more comprehensible through visual simulations. These projects collectively signal a shift from theoretical benchmarks to practical applications, emphasizing the need for AI systems that can operate effectively in dynamic and complex environments.