Home / Companies / LllamaIndex / Blog / Post Details
Content Deep Dive

Improving Table Parsing for Word (.docx) Documents

Blog post from LllamaIndex

Post Details
Company
Date Published
Author
Jerry Liu
Word Count
756
Language
English
Hacker News Points
-
Summary

LlamaParse has significantly improved its ability to parse tables from Word documents by updating its approach to leverage the structural information available in the .docx format. Unlike PDFs, where tables are represented by line segments and text at absolute coordinates, .docx files are ZIP archives containing XML files based on Microsoft's Open XML specification, which explicitly define tables, rows, and cells with formatting details like bold, italic, and merged cells. The challenge of pagination, as .docx documents do not inherently define page boundaries, was addressed by developing a proprietary technique that maps Word XML table elements to their correct page positions in the rendered output. This advancement allows for accurate conversion of table contents directly to markdown, preserving complex formatting and structure, which was previously difficult when using "naive" conversion methods or when relying solely on Visual Layout Machines (VLMs). The update currently focuses on tables but may be extended to other structural elements in the future, offering enhanced parsing capabilities for documents with table-heavy content.