Nirvana - Vision Encoder for real-time Optical Character Recognition (OCR) and Visual Understanding
Blog post from Video SDK
The paper introduces an advanced computer vision solution using a Vision Transformer (ViT) model coupled with a small language model to tackle existing challenges in real-time Optical Character Recognition (OCR), visual reasoning, and integration with current workflows. The ViT model, with around 600 million parameters, is designed to process both image and video inputs, featuring a sophisticated embedding strategy for capturing spatial and temporal information. This approach enhances OCR capabilities, enabling accurate processing of diverse text styles and conditions, and improves visual reasoning by generating meaningful textual insights from visual data. Extensive experiments demonstrated the model's superior performance, achieving high accuracy across OCR tasks, visual reasoning, and seamless integration into existing systems, all while maintaining real-time processing speeds and reducing integration time by 40%. The results indicate significant improvements over state-of-the-art systems, particularly in character and word accuracy, and underscore the model's potential for diverse and dynamic computer vision applications.