Home / Companies / Unstructured / Blog / Post Details
Content Deep Dive

Speeding Up Vision Transformers

Blog post from Unstructured

Post Details
Company
Date Published
Author
Unstructured
Word Count
730
Language
English
Hacker News Points
-
Summary

In the field of document understanding, employing larger images necessitates algorithmic improvements to maintain efficiency while processing them through vision transformers (ViTs), which inherently have a quadratic cost related to input length. Standard optimization techniques such as quantization and pruning can double processing speed, and adaptations like EfficientFormer combine CNNs with transformers to increase speed, albeit with reduced performance compared to more resource-intensive networks. Techniques like Swin ViT improve efficiency by splitting images into smaller patches, and sparse attention approaches reduce computational complexity significantly, though their practical effectiveness remains uncertain. Other innovations, such as Performer, achieve linear performance by rearranging matrices used in attention calculations, but still lag behind more advanced models like Swin ViT. Knowledge distillation from complex networks to more efficient ones could bridge performance gaps, and simpler models may identify main text areas for decoder transformers. The Unstructured team is actively exploring these methods to create faster document preprocessing applications using ViTs, encouraging followers to engage with their ongoing research efforts on platforms like LinkedIn, Huggingface, and GitHub.