Home / Companies / Replicate / Blog / Post Details
Content Deep Dive

Extract text from documents and images with Datalab Marker and OCR

Blog post from Replicate

Post Details
Company
Date Published
Author
andreasjansson
Word Count
594
Language
English
Hacker News Points
-
Summary

Datalab's advanced document parsing and text extraction models, Marker and OCR, are available on Replicate, offering state-of-the-art capabilities for converting various document formats, including PDFs and images, into markdown or JSON. Marker can process documents rapidly, transforming them into structured data while handling tables, math, and specific fields using a JSON Schema. OCR supports text recognition in ninety languages, providing reading order and table grids. Both models outperform established tools like Tesseract in speed and accuracy, with Marker excelling in structured extraction tasks as demonstrated by its superior performance on the olmOCR-Bench benchmark. Marker and OCR are accessible via code snippets on Replicate, with competitive pricing for different usage modes, making them versatile tools for efficient data extraction and document processing.