Introducing SCORE-Bench: An Open Benchmark for Document Parsing
Blog post from Unstructured
In the document parsing field, transparency issues hinder fair comparisons of system accuracy, as traditional evaluation metrics are outdated for modern vision-language models that produce diverse valid outputs. To address this, SCORE-Bench, a newly open-sourced benchmark dataset, offers a diverse collection of real-world documents with expert annotations, enabling fair comparisons and independent validation of document parsing systems. SCORE-Bench includes complex and varied formats, such as handwritten forms and technical manuals, to differentiate robust production-ready systems from research prototypes, addressing real-world challenges like poor scan quality and mixed languages. The new Structural and Content Robust Evaluation (SCORE) framework mitigates biases in traditional metrics by evaluating systems on content fidelity, hallucination control, and table extraction, proving particularly challenging for systems due to skewed text, dense layouts, and semantic ambiguity. The Unstructured pipelines demonstrate leading performance across several metrics, such as Adjusted Clean Concatenated Text (CCT) for content fidelity and maintaining low hallucination rates, establishing themselves as the most production-ready solutions. The dataset and evaluation code are available on Hugging Face and GitHub, inviting the community to test and benchmark their systems using this comprehensive methodology.