Home / Companies / Encord / Blog / Post Details
Content Deep Dive

Florence-2: Microsoft's New Foundation Model Explained

Blog post from Encord

Post Details
Company
Date Published
Author
Akruti Acharya
Word Count
1,364
Language
English
Hacker News Points
-
Summary

Florence-2 is a vision foundation model designed to address the challenges of task diversity in computer vision and vision-language tasks. It leverages multitask learning with extensive visual annotations, resulting in a unified prompt-based representation for diverse vision tasks. Built by Microsoft, Florence-2 adopts a sequence-to-sequence architecture integrating an image encoder and a multi-modality encoder-decoder, accommodating a spectrum of vision tasks without the need for task-specific architectural modifications. The model achieves zero-shot and fine-tuning capabilities, establishing new state-of-the-art results in tasks such as captioning, object detection, visual grounding, and referring expression comprehension. Its performance and efficiency surpass that of other models like PolyFormer and UNINEXT, making it a groundbreaking vision foundation model showcasing the immense potential of multi-task learning and the fusion of textual and visual information.