Home / Companies / Unstructured / Blog / Post Details
Content Deep Dive

Preserving Table Structure for Better Retrieval

Blog post from Unstructured

Post Details
Company
Date Published
Author
Ajay Krishnan
Word Count
1,664
Language
English
Hacker News Points
-
Summary

The document outlines a workflow designed to handle highly structured documents like 10-Qs, earnings releases, and technical briefs by preserving their intricate table structures, which are often lost in standard processing pipelines. This workflow uses a series of nodes, including partitioners, summarizers, chunkers, and embedders, to maintain the visual and semantic integrity of documents as they are processed. The pipeline starts by extracting document structure using Unstructured's hi_res partitioning strategy, which maintains the layout, structure, and block types of multi-row tables and images. Summarization nodes enrich visual elements with natural language descriptions, while chunkers and embedders break down and vectorize text blocks for storage in Astra DB. The preserved structure allows for precise semantic searches, enabling accurate retrieval and rendering of original document formats, which is particularly beneficial for applications requiring exact data references, clean visual displays, and further analytical processing. This approach contrasts with traditional pipelines that often flatten data, leading to loss of valuable information, and highlights the importance of maintaining document structure for effective retrieval and utilization.