Home / Companies / Roboflow / Blog / Post Details
Content Deep Dive

Comparing Base and Fine-Tuned SmolVLM2 for OCR

Blog post from Roboflow

Post Details
Company
Date Published
Author
Aryan Vasudevan
Word Count
1,876
Language
English
Hacker News Points
-
Summary

Vision-Language Models (VLMs) have become crucial tools in AI systems for integrating image and natural language understanding, and this text explores their application in Roboflow Workflows, specifically focusing on Optical Character Recognition (OCR) of NBA jerseys. A demonstration involves creating a Workflow that combines an object detection model with SmolVLM2, a VLM capable of answering questions about images to streamline OCR processes. The text outlines the benefits of fine-tuning SmolVLM2, which enhances the model's speed and accuracy, as evidenced in a project that compares the fine-tuned and base models on their ability to recognize jersey numbers from video frames. The fine-tuned model, trained with specific use cases and augmented data, outperformed the base model by achieving higher accuracy and faster processing times. Overall, the results underscore the value of fine-tuning VLMs for improved performance in complex tasks like OCR, highlighting significant advancements in integrating vision and language in AI workflows.