How to Extract Data from Tables in PDF

Post Details

Company

LllamaIndex

Date Published

Nov. 26, 2025

Author

Neeraj Pradhan

Word Count

1,688

Language

English

Hacker News Points

-

Source URL

www.llamaindex.ai/blog/extracting-repeating-entities-from-documents

Summary

LlamaExtract is a structured extraction API that leverages large language models (LLMs) for flexible data extraction, addressing the limitations of traditional template-based systems which can be brittle with format changes. It introduces a new feature, the Table Row extraction target, to enhance the extraction of repetitive data entities, such as those found in tables or catalogs, by focusing on entity-level rather than document-level granularity. This approach solves the issue of LLMs struggling with exhaustive enumeration in long lists by using intelligent document segmentation, pattern recognition, and schema application for each entity. LlamaExtract's method is particularly effective for documents with repeating entities, improving extraction quality by aligning extraction granularity with document structure, thereby achieving comprehensive coverage with LLM flexibility and reliability comparable to template-based systems. The extraction method has shown success in real-world applications, such as extracting hospital data from a structured table and products from a semi-structured toy catalog, demonstrating its capability to adapt to various document formats.