Home / Companies / Roboflow / Blog / Post Details
Content Deep Dive

Florence-2: Vision-language Model

Blog post from Roboflow

Post Details
Company
Date Published
Author
Piotr Skalski
Word Count
993
Language
English
Hacker News Points
-
Summary

Florence-2 is an open-source vision-language model developed by Microsoft, notable for its compact size and robust capabilities across tasks like captioning, object detection, grounding, and segmentation, rivaling larger models such as Kosmos-2. It utilizes a unified representation approach, supported by the extensive FLD-5B dataset, which contains 126 million images and 5.4 billion annotations, enabling it to handle over ten different tasks without requiring separate models. This model employs a DaViT vision encoder and a transformer-based multi-modal encoder-decoder, allowing it to generate responses from image and task prompt inputs. Florence-2's efficiency on both CPU and GPU platforms, despite its small parameter size, makes it suitable for deployment on mobile devices and real-world applications. The model's advancement is attributed to its integration of spatial hierarchy and semantic granularity, and its effectiveness has been demonstrated across various benchmarks, even outperforming larger models in zero-shot settings.