DINOv3: An Advanced Self-Supervised Vision Foundation Model by Meta
Blog post from Roboflow
DINOv3, developed by Meta, is the latest advancement in the DINO series of self-supervised vision foundation models, designed to learn universal visual representations without relying on any human annotations. Utilizing self-supervised learning (SSL), DINOv3 achieves state-of-the-art performance across a variety of computer vision tasks, demonstrating exceptional capabilities even with minimal fine-tuning. It scales up both the dataset and model size, introducing innovations like Gram Anchoring to enhance feature quality and semantic coherence. Pre-trained on vast curated datasets, DINOv3 serves as a robust backbone for tasks such as image classification, object detection, semantic segmentation, and more, excelling in scenarios requiring few-shot and zero-shot learning. The model's code and pre-trained weights are available under a custom license, and it can be accessed via platforms like Roboflow and Hugging Face for training and deployment. DINOv3's architecture builds on a teacher-student model with a vision transformer backbone, facilitating the learning of detailed and transferable features. It represents a significant leap in self-supervised learning, setting new benchmarks in the field of computer vision.