Introducing PDF Parser v2: Faster Extraction with Auto Mode
Blog post from Firecrawl
Firecrawl has introduced a new PDF Parser v2, featuring a Rust-based parsing engine that significantly improves the speed and reliability of extracting data from PDFs, making it up to three times faster than the previous version. This updated parser offers three modes: Fast, Auto, and OCR, each tailored to different document types, from clean text-based PDFs to complex layouts and image-only files. The Auto mode, set as the default, combines rapid text extraction with an automatic fallback to OCR to handle documents with mixed encodings or intricate structures, ensuring comprehensive and accurate data retrieval. This enhanced capability allows AI agents and knowledge bases to process complex documents such as technical papers and regulatory filings more effectively, leading to more accurate data embeddings and improved retrieval accuracy, thereby benefiting applications in AI search, deep research, and real-time market intelligence. The new parser requires no code changes for existing users, and its implementation promises to streamline the extraction of structured data from complex PDF sources.