How to Fine-Tune a SmolVLM2 Model on a Custom Dataset
Blog post from Roboflow
Released by Hugging Face in February 2025, SmolVLM2 is a multimodal image and video model designed for tasks such as visual question answering and structured information retrieval. The guide by James Gallagher details a comprehensive process for fine-tuning and deploying SmolVLM2 using Roboflow, focusing on transforming a shipping manifest into a JSON format. It begins with preparing and annotating a dataset, creating a dataset version, and then training the model. The fine-tuning process employs a pre-labeled shipping manifest dataset, and once the model is trained, it is deployed using Roboflow Workflows with a GPU Dedicated Deployment. The guide concludes by demonstrating how the model successfully interprets receipt data, returning it as a structured JSON object, and offers instructions for deploying the model both in the cloud and on personal hardware.