Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

How we OCR'ed 30,000 papers using Codex, open OCR models and Jobs

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Niels Rogge
Word Count
1,246
Language
-
Hacker News Points
-
Summary

Hugging Face implemented a process to convert 27,000 arXiv papers lacking HTML versions into Markdown using an open OCR model, Chandra-OCR 2, to enable a chat feature powered by HuggingChat on their platform. This initiative was facilitated by leveraging Hugging Face Jobs, a serverless compute platform supporting GPU infrastructure, and OpenAI's Codex to automate the deployment of OCR processing on a large scale. The project involved selecting the optimal GPU configuration, leading to the use of 16 Nvidia L40S GPUs running in parallel, which proved cost-effective and efficient. The results were stored using Hugging Face's Buckets for scalable storage, allowing for easier integration and access through the Hugging Face platform, thereby enhancing user interaction with research papers by enabling a chat functionality even for those without an HTML version on arXiv.