Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Building a Fast Multilingual OCR Model with Synthetic Data

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Ryan Chesler
Word Count
2,218
Company Posts That Month
61
Language
-
Hacker News Points
-
Summary

Ryan Chesler's article discusses the development of Nemotron OCR v2, a fast and accurate multilingual Optical Character Recognition (OCR) model built using synthetic data. Traditional methods of obtaining annotated image-text pairs for OCR training face challenges due to limited scale and expensive manual annotation. Existing datasets are skewed towards certain languages, and web-scraped PDFs often contain noisy text. To overcome these limitations, synthetic data generation is proposed, allowing for scalable and precise data creation by programmatically rendering text onto images. This approach enables the generation of large-scale, high-quality datasets across multiple languages, with Nemotron OCR v2 achieving significant improvements in accuracy and speed. The new model reduces Normalized Edit Distance (NED) scores dramatically across various languages and achieves a processing speed of 34.7 pages per second on a single A100 GPU. The synthetic data pipeline is designed to be extensible, capable of supporting additional languages with the availability of appropriate fonts and source texts, and the dataset is publicly available for further use or research.

Trends Found in this Post
Trend Post Mentions Total Month Mentions Posts Companies MoM
Data Pipeline 2 770 196 80 +5%
Vector Search 2 1,739 413 146 -27%