Company
Date Published
Author
Yao Xu, Timo Roman, Lukas Voegtle, Philipp Fischer, Amala Sanjay Deshmukh, Kateryna Chumachenko, and Jarno Seppänen
Word count
1014
Language
-
Hacker News points
None

Summary

NVIDIA has released the Nemotron VLM Dataset V2, significantly expanding its previous version by adding 8 million new samples, bringing the total to 11 million. This dataset, designed for optical character recognition (OCR), image reasoning, and video question answering (QA) tasks, introduces new data modalities like video and complex diagrams and focuses on enhancing reasoning through chain-of-thought data. It includes a novel LaTeX pipeline for producing multilingual OCR training data, preserving precise layout and semantic context. NVIDIA's commitment to transparency and ethical AI is reflected in the comprehensive safety reviews and open-source tools provided alongside the dataset, which supports enterprise-level AI development and is ready for commercial use. The dataset composition includes a mix of image QA, OCR, video QA, and image reasoning samples and is available for exploration and use on Hugging Face.