Building a Fast Multilingual OCR Model with Synthetic Data

Post Details

Company

HuggingFace

Date Published

April 17, 2026

Author

Ryan Chesler

Word Count

2,218

Company Posts That Month

61

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/nvidia/nemotron-ocr-v2

Summary

Ryan Chesler's article discusses the development of Nemotron OCR v2, a fast and accurate multilingual Optical Character Recognition (OCR) model built using synthetic data. Traditional methods of obtaining annotated image-text pairs for OCR training face challenges due to limited scale and expensive manual annotation. Existing datasets are skewed towards certain languages, and web-scraped PDFs often contain noisy text. To overcome these limitations, synthetic data generation is proposed, allowing for scalable and precise data creation by programmatically rendering text onto images. This approach enables the generation of large-scale, high-quality datasets across multiple languages, with Nemotron OCR v2 achieving significant improvements in accuracy and speed. The new model reduces Normalized Edit Distance (NED) scores dramatically across various languages and achieves a processing speed of 34.7 pages per second on a single A100 GPU. The synthetic data pipeline is designed to be extensible, capable of supporting additional languages with the availability of appropriate fonts and source texts, and the dataset is publicly available for further use or research.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Data Pipeline	2	770	196	80	+5%
Vector Search	2	1,739	413	146	-27%