An Introduction to Vision Transformers for Document Understanding

Post Details

Company

Unstructured

Date Published

Dec. 5, 2022

Author

Unstructured

Word Count

592

Language

English

Hacker News Points

-

Source URL

unstructured.io/insights/an-introduction-to-vision-transformers-for-document-understanding

Summary

Document understanding algorithms leverage an encoder-decoder pipeline that integrates computer vision (CV) and natural language processing (NLP) techniques to analyze documents, where the CV component treats the document as an image to create a representation for processing by a transformer. Vision transformers (ViTs), an emerging alternative to convolutional neural networks (CNNs), split images into patches, convert them into linear embeddings, and feed these into a transformer encoder, offering advantages such as better global relation grasp and resilience to adversarial attacks, though they require more training data due to fewer inductive biases. These transformers are computationally intensive, but pre-training with large datasets can mitigate some challenges. The Hugging Face Vision Encode Decoder models and Donut models exemplify how these technologies can transform input images directly into structured outputs like JSON, which is advantageous for document understanding tasks such as processing receipts. However, unlike models such as LayoutLMv3 that use preprocessing, Donut's direct conversion lacks bounding box information, limiting location context for extracted data. The Unstructured team is developing pipelines using Donut for extracting structured data from documents and plans to release these models soon on platforms like GitHub and Huggingface.