How we OCR'ed 30,000 papers using Codex, open OCR models and Jobs

Post Details

Company

Hugging Face

Date Published

April 7, 2026

Author

Niels Rogge

Word Count

1,246

Company Posts That Month

61

Language

-

Hacker News Points

-

Post removed?

No

Source URL

huggingface.co/blog/nielsr/ocr-papers-jobs

Summary

Hugging Face implemented a process to convert 27,000 arXiv papers lacking HTML versions into Markdown using an open OCR model, Chandra-OCR 2, to enable a chat feature powered by HuggingChat on their platform. This initiative was facilitated by leveraging Hugging Face Jobs, a serverless compute platform supporting GPU infrastructure, and OpenAI's Codex to automate the deployment of OCR processing on a large scale. The project involved selecting the optimal GPU configuration, leading to the use of 16 Nvidia L40S GPUs running in parallel, which proved cost-effective and efficient. The results were stored using Hugging Face's Buckets for scalable storage, allowing for easier integration and access through the Hugging Face platform, thereby enhancing user interaction with research papers by enabling a chat functionality even for those without an HTML version on arXiv.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
Serverless	2	678	211	91	-7%
LLM	1	5,932	1,046	223	-2%
MCP	1	6,108	613	170	+36%

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.