Building a Fast Multilingual OCR Model with Synthetic Data
Blog post from HuggingFace
Ryan Chesler's article discusses the development of Nemotron OCR v2, a fast and accurate multilingual Optical Character Recognition (OCR) model built using synthetic data. Traditional methods of obtaining annotated image-text pairs for OCR training face challenges due to limited scale and expensive manual annotation. Existing datasets are skewed towards certain languages, and web-scraped PDFs often contain noisy text. To overcome these limitations, synthetic data generation is proposed, allowing for scalable and precise data creation by programmatically rendering text onto images. This approach enables the generation of large-scale, high-quality datasets across multiple languages, with Nemotron OCR v2 achieving significant improvements in accuracy and speed. The new model reduces Normalized Edit Distance (NED) scores dramatically across various languages and achieves a processing speed of 34.7 pages per second on a single A100 GPU. The synthetic data pipeline is designed to be extensible, capable of supporting additional languages with the availability of appropriate fonts and source texts, and the dataset is publicly available for further use or research.
| Trend | Post Mentions | Total Month Mentions | Posts | Companies | MoM |
|---|---|---|---|---|---|
| Data Pipeline | 2 | 770 | 196 | 80 | +5% |
| Vector Search | 2 | 1,739 | 413 | 146 | -27% |