An Introduction to Vision Transformers for Document Understanding
Blog post from Unstructured
Document understanding algorithms leverage an encoder-decoder pipeline that integrates computer vision (CV) and natural language processing (NLP) methods to analyze document content, treating documents as input images to produce representations for multimodal transformers. Vision transformers (ViTs), which resemble NLP architectures like BERT, have emerged as an alternative to traditional convolutional neural networks (CNNs), offering advantages such as enhanced global relation grasping and resilience to adversarial attacks, though they require more training data and computational resources. HuggingFace's Vision Encoder Decoder models incorporate ViTs, facilitating the development of document understanding models like Donut, which processes input images to generate structured representations without requiring preprocessing steps like OCR. Despite omitting bounding box information, Donut efficiently converts input images directly into structured outputs, such as JSON, and is part of ongoing efforts to extract structured data from receipts and invoices, with models soon to be available on platforms like GitHub and Huggingface.