Up to 100x Fast Parsing with LiteParse v2.0 and Rust
Blog post from LllamaIndex
LiteParse, initially launched as a PDF extractor running solely as a Node/Typescript package, has been expanded into a versatile tool available in Rust, Node, Python, and WASM, allowing it to run on various platforms, including browsers and edge runtimes. The transition to Rust has significantly enhanced performance, offering a 5-100x speedup for small documents and a 3x speedup for larger ones, making it competitive with other PDF parsing utilities. This was achieved by utilizing a custom build of PDFium and tesseract-rs for OCR, ensuring high efficiency in document processing. The Rust implementation simplifies integration across different language bindings, making it easier to distribute and maintain. The WASM package enables LiteParse to operate directly in browsers, with OCR functionality handled via callbacks, providing a seamless experience for real-time applications requiring fast document parsing.