Home / Companies / Reducto / Blog / Post Details
Content Deep Dive

Announcing RD-TableBench: An Open-Source Table Benchmark

Blog post from Reducto

Post Details
Company
Date Published
Author
-
Word Count
687
Language
English
Hacker News Points
-
Summary

RD-TableBench is an open benchmark designed to assess the extraction performance of various models on complex tables, incorporating scenarios such as scanned tables, handwriting, and merged cells. A team of PhD-level human labelers manually annotated a diverse set of 1000 complex table images from publicly available documents, ensuring the dataset's variety in structure, text density, and language. The initial evaluation included tools like Reducto, Azure Document Intelligence, AWS Textract Tables, and others, which were tested using high-quality settings where applicable. To effectively measure table similarity, RD-TableBench employs a hierarchical alignment approach akin to DNA sequence alignment, using the Needleman-Wunsch algorithm to assess both cell-level and row-level alignments. Levenshtein distance is used for cell-level comparisons, and the final similarity score is normalized between 0 and 1. Unlike other datasets such as PubTabNet and FinTabNet, RD-TableBench aims to provide a richer set of real-world examples with accurate manual annotations. While its primary purpose is evaluation and testing, a subset of the evaluation framework is being released to maintain scoring integrity, acknowledging the potential use of this data in future model training.