First Impressions with Google’s Gemini
Blog post from Roboflow
In December 2023, Google introduced Gemini, a new Large Multimodal Model (LMM) capable of processing text, images, and audio, with initial text integration into Bard and plans for full multimodality. An API was released for developers to integrate Gemini into applications, and the Roboflow team conducted an analysis comparing Gemini to other LMMs like GPT-4 with Vision, LLaVA, and CogVLM across various computer vision tasks, including Visual Question Answering (VQA), Optical Character Recognition (OCR), document OCR, and object detection. During testing, Gemini successfully handled some tasks, such as identifying the number of coins and recognizing movie scenes, but struggled with others, including precise OCR and object detection coordinates. The model's performance was mixed, with some successes and technical issues during testing. Gemini is part of a growing field of multimodal models, highlighting Google's efforts to advance LMM capabilities in the competitive landscape.