Comparing Base and Fine-Tuned SmolVLM2 for OCR

Post Details

Company

Roboflow

Date Published

July 14, 2025

Author

Aryan Vasudevan

Word Count

1,876

Language

English

Hacker News Points

-

Source URL

blog.roboflow.com/base-vs-fine-tuned-smolvlm2-ocr

Summary

Vision-Language Models (VLMs) have become crucial tools in AI systems for integrating image and natural language understanding, and this text explores their application in Roboflow Workflows, specifically focusing on Optical Character Recognition (OCR) of NBA jerseys. A demonstration involves creating a Workflow that combines an object detection model with SmolVLM2, a VLM capable of answering questions about images to streamline OCR processes. The text outlines the benefits of fine-tuning SmolVLM2, which enhances the model's speed and accuracy, as evidenced in a project that compares the fine-tuned and base models on their ability to recognize jersey numbers from video frames. The fine-tuned model, trained with specific use cases and augmented data, outperformed the base model by achieving higher accuracy and faster processing times. Overall, the results underscore the value of fine-tuning VLMs for improved performance in complex tasks like OCR, highlighting significant advancements in integrating vision and language in AI workflows.