How we OCR'ed 30,000 papers using Codex, open OCR models and Jobs
Blog post from HuggingFace
Hugging Face implemented a process to convert 27,000 arXiv papers lacking HTML versions into Markdown using an open OCR model, Chandra-OCR 2, to enable a chat feature powered by HuggingChat on their platform. This initiative was facilitated by leveraging Hugging Face Jobs, a serverless compute platform supporting GPU infrastructure, and OpenAI's Codex to automate the deployment of OCR processing on a large scale. The project involved selecting the optimal GPU configuration, leading to the use of 16 Nvidia L40S GPUs running in parallel, which proved cost-effective and efficient. The results were stored using Hugging Face's Buckets for scalable storage, allowing for easier integration and access through the Hugging Face platform, thereby enhancing user interaction with research papers by enabling a chat functionality even for those without an HTML version on arXiv.