Evaluating OCR-to-Markdown Systems Is Fundamentally Broken (and Why That’s Hard to Fix)
Blog post from Nanonets
Evaluating OCR systems that convert PDFs or document images into Markdown is challenging due to the complexity of retaining content, layout, reading order, and representation choices simultaneously. Unlike plain text OCR, OCR-to-Markdown systems face difficulties because benchmarks often misclassify correct outputs as failures, using methods like string matching and heuristic alignment that do not account for multiple valid outputs. These include different reading orders for multi-column layouts or various equation representations. Common evaluation techniques, such as string-based metrics and order-sensitive block matching, often penalize valid outputs due to rigid assumptions about structure and formatting. Notably, benchmarks like olmOCRBench and OmniDocBench exhibit issues, such as implicit content omission rules and strict LaTeX string equivalence, which result in penalizing correct predictions that do not align with their predefined assumptions. This discrepancy between human judgment and benchmark scoring highlights the inadequacy of current evaluation methods and suggests that using large language models as judges, despite their imperfections, may be a more practical approach.