Introducing Fire-PDF: Firecrawl's New PDF Parsing Engine
Blog post from Firecrawl
Fire-PDF is a newly developed PDF parsing engine designed to address the challenges of processing complex PDF documents by offering a balance between speed and accuracy. Built using Rust, Fire-PDF effectively converts any PDF, whether text-based, scanned, or mixed, into structured markdown while maintaining the correct reading order, preserving tables and formulas, and handling multi-column layouts. Its enhanced speed, averaging under 400ms per page, is achieved by utilizing a Rust library called pdf-inspector, which quickly classifies pages, allowing text-based pages to bypass GPU processing and only sending scanned or image-heavy content through a neural layout model and OCR. This selective processing reduces GPU usage and costs, resulting in a 3.5-5.7x improvement over previous parsers. Fire-PDF also employs a neural document layout model to accurately detect and handle various document elements, ensuring the proper assembly of complex documents into markdown. The engine is integrated into the Firecrawl API, enabling automatic parsing of PDFs without additional configuration.