Extract text from documents and images with Datalab Marker and OCR
Blog post from Replicate
Datalab's advanced document parsing and text extraction models, Marker and OCR, are available on Replicate, offering state-of-the-art capabilities for converting various document formats, including PDFs and images, into markdown or JSON. Marker can process documents rapidly, transforming them into structured data while handling tables, math, and specific fields using a JSON Schema. OCR supports text recognition in ninety languages, providing reading order and table grids. Both models outperform established tools like Tesseract in speed and accuracy, with Marker excelling in structured extraction tasks as demonstrated by its superior performance on the olmOCR-Bench benchmark. Marker and OCR are accessible via code snippets on Replicate, with competitive pricing for different usage modes, making them versatile tools for efficient data extraction and document processing.