Speeding Up Vision Transformers

Post Details

Company

Unstructured

Date Published

April 11, 2023

Author

Unstructured

Word Count

730

Language

English

Hacker News Points

-

Source URL

unstructured.io/blog/speeding-up-vision-transformers

Summary

In the context of document understanding, the challenge lies in processing larger images without losing information, and while there are existing hardware-dependent optimizations, Unstructured focuses on algorithmic improvements for vision transformers (ViTs). ViTs, which split images into patches before feeding them to a transformer, face a quadratic cost issue related to input length. Existing solutions to enhance ViT processing speed include quantization, pruning, and adapting transformers for environments with limited computing capabilities, such as mobile networks exemplified by EfficientFormer. Techniques like sparse attention matrices and matrix decomposition offer potential reductions in computational complexity. Despite some methods not yet being fully evaluated for document understanding, combining vision transformer approaches with recent advancements in attention calculation cost reduction may yield benefits. The Unstructured team is exploring various strategies, including knowledge distillation and using simpler models to pinpoint main text areas, to develop efficient vision transformers for real-world document preprocessing.