Home / Companies / Unstructured / Blog / Post Details
Content Deep Dive

An Introduction to Vision Transformers for Document Understanding

Blog post from Unstructured

Post Details
Company
Date Published
Author
Unstructured
Word Count
592
Language
English
Hacker News Points
-
Summary

Document understanding algorithms leverage an encoder-decoder pipeline that integrates computer vision (CV) and natural language processing (NLP) techniques to analyze documents, where the CV component treats the document as an image to create a representation for processing by a transformer. Vision transformers (ViTs), an emerging alternative to convolutional neural networks (CNNs), split images into patches, convert them into linear embeddings, and feed these into a transformer encoder, offering advantages such as better global relation grasp and resilience to adversarial attacks, though they require more training data due to fewer inductive biases. These transformers are computationally intensive, but pre-training with large datasets can mitigate some challenges. The Hugging Face Vision Encode Decoder models and Donut models exemplify how these technologies can transform input images directly into structured outputs like JSON, which is advantageous for document understanding tasks such as processing receipts. However, unlike models such as LayoutLMv3 that use preprocessing, Donut's direct conversion lacks bounding box information, limiting location context for extracted data. The Unstructured team is developing pipelines using Donut for extracting structured data from documents and plans to release these models soon on platforms like GitHub and Huggingface.