First Impressions with Google’s Gemini

Post Details

Company

Roboflow

Date Published

Dec. 13, 2023

Author

James Gallagher

Word Count

1,387

Language

English

Hacker News Points

-

Source URL

blog.roboflow.com/first-impressions-with-google-gemini

Summary

In December 2023, Google introduced Gemini, a new Large Multimodal Model (LMM) capable of processing text, images, and audio, with initial text integration into Bard and plans for full multimodality. An API was released for developers to integrate Gemini into applications, and the Roboflow team conducted an analysis comparing Gemini to other LMMs like GPT-4 with Vision, LLaVA, and CogVLM across various computer vision tasks, including Visual Question Answering (VQA), Optical Character Recognition (OCR), document OCR, and object detection. During testing, Gemini successfully handled some tasks, such as identifying the number of coins and recognizing movie scenes, but struggled with others, including precise OCR and object detection coordinates. The model's performance was mixed, with some successes and technical issues during testing. Gemini is part of a growing field of multimodal models, highlighting Google's efforts to advance LMM capabilities in the competitive landscape.