Speeding Up Vision Transformers

Post Details

Company

Unstructured

Date Published

April 11, 2023

Author

Unstructured

Word Count

730

Language

English

Hacker News Points

-

Source URL

unstructured.io/insights/speeding-up-vision-transformers

Summary

In the field of document understanding, employing larger images necessitates algorithmic improvements to maintain efficiency while processing them through vision transformers (ViTs), which inherently have a quadratic cost related to input length. Standard optimization techniques such as quantization and pruning can double processing speed, and adaptations like EfficientFormer combine CNNs with transformers to increase speed, albeit with reduced performance compared to more resource-intensive networks. Techniques like Swin ViT improve efficiency by splitting images into smaller patches, and sparse attention approaches reduce computational complexity significantly, though their practical effectiveness remains uncertain. Other innovations, such as Performer, achieve linear performance by rearranging matrices used in attention calculations, but still lag behind more advanced models like Swin ViT. Knowledge distillation from complex networks to more efficient ones could bridge performance gaps, and simpler models may identify main text areas for decoder transformers. The Unstructured team is actively exploring these methods to create faster document preprocessing applications using ViTs, encouraging followers to engage with their ongoing research efforts on platforms like LinkedIn, Huggingface, and GitHub.