Visualizing How VLMs Work

Company

HuggingFace

Date Published

Oct. 7, 2025

Author

Hafedh Hichri and Ed Daniels

Word count

1851

Language

Hacker News points

None

URL

huggingface.co/blog/not-lain/vlms

Summary

Visual Language Models (VLMs) like Idefics3 and SmolVLM are autoregressive AI models capable of processing both text and images to generate coherent outputs. These models integrate multimodal data by preparing text and image inputs into a unified format, with images being split into patches and represented as tokens. The text processor inserts image placeholders within the text sequence, which are later expanded based on the number of image splits. The model architecture includes an embedding layer for text, a vision model for converting image data into high-dimensional patch embeddings, a connector for unifying visual and textual embeddings, and an input merger that integrates these embeddings into a single sequence. The decoder, similar to traditional language models, uses Masked Multi-Head Attention and a Language Modeling head to generate context-aware text by attending to both visual and textual inputs. This allows VLMs to reason across modalities, making them versatile for various multimodal applications, whether handling text-only, image-only, or combined inputs.