Announcing RD-TableBench: An Open-Source Table Benchmark

Post Details

Company

Reducto

Date Published

Nov. 4, 2024

Author

-

Word Count

687

Company Posts That Month

2

Language

English

Hacker News Points

-

Post removed?

No

Source URL

reducto.ai/blog/rd-tablebench

Summary

RD-TableBench is an open benchmark designed to assess the extraction performance of various models on complex tables, incorporating scenarios such as scanned tables, handwriting, and merged cells. A team of PhD-level human labelers manually annotated a diverse set of 1000 complex table images from publicly available documents, ensuring the dataset's variety in structure, text density, and language. The initial evaluation included tools like Reducto, Azure Document Intelligence, AWS Textract Tables, and others, which were tested using high-quality settings where applicable. To effectively measure table similarity, RD-TableBench employs a hierarchical alignment approach akin to DNA sequence alignment, using the Needleman-Wunsch algorithm to assess both cell-level and row-level alignments. Levenshtein distance is used for cell-level comparisons, and the final similarity score is normalized between 0 and 1. Unlike other datasets such as PubTabNet and FinTabNet, RD-TableBench aims to provide a richer set of real-world examples with accurate manual annotations. While its primary purpose is evaluation and testing, a subset of the evaluation framework is being released to maintain scoring integrity, acknowledging the potential use of this data in future model training.

Trends Found in this Post

No tracked trend matches for this post yet.

Use This Data

Use this post, company, and trend context to find content marketing opportunities, perform competitive analysis, or address product feature gaps via the Plushcap MCP server or the Plushcap API.