Gemma 3: Multimodal and Vision Analysis
Blog post from Roboflow
Gemma 3, the latest in Google's series of multimodal language models, offers enhanced capabilities for tasks involving both text and images, such as visual question answering, document optical character recognition (OCR), and object counting. Released in four sizes, from 1B to 27B parameters, Gemma 3 supports a 128K token context window—significantly larger than its predecessors—which facilitates the processing of extensive text and multiple images simultaneously. The model's proficiency was demonstrated in tests where it successfully completed six out of seven tasks, only faltering on zero-shot object detection. Notably, larger versions of Gemma 3 are trained with multilingual data, making them suitable for non-English applications. This model is accessible via platforms like Kaggle and Hugging Face, with instruction-tuned checkpoints available for guided interactions.