NVIDIA Releases 8 Million Sample Open Dataset and Tooling for OCR, Image Reasoning, Image and Video QA Tasks

Post Details

Company

HuggingFace

Date Published

Oct. 28, 2025

Author

Yao Xu, Timo Roman, Lukas Voegtle, Philipp Fischer, Amala Sanjay Deshmukh, Kateryna Chumachenko, and Jarno Seppänen

Word Count

1,014

Company Posts That Month

41

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/nvidia/nemotron-vlm-dataset-v2

Summary

NVIDIA has released the Nemotron VLM Dataset V2, significantly expanding its previous version by adding 8 million new samples, bringing the total to 11 million. This dataset, designed for optical character recognition (OCR), image reasoning, and video question answering (QA) tasks, introduces new data modalities like video and complex diagrams and focuses on enhancing reasoning through chain-of-thought data. It includes a novel LaTeX pipeline for producing multilingual OCR training data, preserving precise layout and semantic context. NVIDIA's commitment to transparency and ethical AI is reflected in the comprehensive safety reviews and open-source tools provided alongside the dataset, which supports enterprise-level AI development and is ready for commercial use. The dataset composition includes a mix of image QA, OCR, video QA, and image reasoning samples and is available for exploration and use on Hugging Face.

Trends Found in this Post

No tracked trend matches for this post yet.