Cosmos 3: Evaluation for Vision Use Cases
Blog post from Roboflow
Cosmos 3 is an advanced foundation model for physical AI, designed to manage vision reasoning and multimodal generation across various media such as text, image, video, sound, and action. Released under the OpenMDW 1.1 license, it comes in two variants, Super (32B) and Nano (8B), and is available on GitHub. The model excels in processing fixed-camera footage, as demonstrated in tests involving an airport gate, a warehouse, and a kitchen assembly line, showing its ability to segment and track slow-changing states more effectively than fast-moving actions. While Cosmos 3 performs well on VANTAGE-Bench, challenges remain, especially with scenes containing many similar small objects and fast actions, highlighting the importance of scene framing and spatial grounding. Despite its strengths, Cosmos 3 still requires iterations of data labeling, training, and deployment to achieve reliable operational use, especially in complex environments like kitchens where ingredient placement and timing are critical.