Extract Tabular Data from PDFs, Images or Documents

Company

Nanonets

Date Published

Feb. 21, 2022

Author

Tim Cheng

Word count

1945

Language

English

Hacker News points

None

URL

nanonets.com/blog/extract-tabular-data

Summary

Building on its simplicity, Excel table has been the most predominant way of storing any structured data digitally. The seemingly simple spreadsheets are actually tightly linked to the daily data processing by large corporations and organizations. In a few clicks, companies can now distribute tasks to different workers, keep track of budgets from multiple cash flows, and even make accurate predictions from past data. However, extracting data from pre-existing tables, scans, or images in the first place isn't easy. Delivering error-free tabular data extraction seems to be something so close yet so difficult to achieve. The task of extracting tabular data can be divided into two sub-problems: 1) extracting tables from scans/images/PDF documents where the format is not recognizable by machines and 2) understanding/interpreting the words inside table cells so that it could be properly imported into CSV files for spreadsheets. This process involves various use cases such as business cash flow tracking, cross-business record transfer, and accounting firms. To tackle this task, high-level steps involve using deep learning concepts like CNNs and RNNs to classify documents, detect tables, and perform optical character recognition. The process also requires converting PDF files into image formats, finding tables within images, and extracting content via Google Vision API or other OCR services. For non-technical users, Nanonets offers a user-friendly interface for extracting tabular data from invoices, receipts, and other documents without requiring coding knowledge.