Florence-2 is a vision foundation model designed to address the challenges of task diversity in computer vision and vision-language tasks. It leverages multitask learning with extensive visual annotations, resulting in a unified prompt-based representation for diverse vision tasks. Built by Microsoft, Florence-2 adopts a sequence-to-sequence architecture integrating an image encoder and a multi-modality encoder-decoder, accommodating a spectrum of vision tasks without the need for task-specific architectural modifications. The model achieves zero-shot and fine-tuning capabilities, establishing new state-of-the-art results in tasks such as captioning, object detection, visual grounding, and referring expression comprehension. Its performance and efficiency surpass that of other models like PolyFormer and UNINEXT, making it a groundbreaking vision foundation model showcasing the immense potential of multi-task learning and the fusion of textual and visual information.