Home / Companies / Unstructured / Blog / Post Details
Content Deep Dive

Speeding Up Vision Transformers

Blog post from Unstructured

Post Details
Company
Date Published
Author
Unstructured
Word Count
730
Language
English
Hacker News Points
-
Summary

In the context of document understanding, the challenge lies in processing larger images without losing information, and while there are existing hardware-dependent optimizations, Unstructured focuses on algorithmic improvements for vision transformers (ViTs). ViTs, which split images into patches before feeding them to a transformer, face a quadratic cost issue related to input length. Existing solutions to enhance ViT processing speed include quantization, pruning, and adapting transformers for environments with limited computing capabilities, such as mobile networks exemplified by EfficientFormer. Techniques like sparse attention matrices and matrix decomposition offer potential reductions in computational complexity. Despite some methods not yet being fully evaluated for document understanding, combining vision transformer approaches with recent advancements in attention calculation cost reduction may yield benefits. The Unstructured team is exploring various strategies, including knowledge distillation and using simpler models to pinpoint main text areas, to develop efficient vision transformers for real-world document preprocessing.