RynnEC: Bringing MLLMs into Embodied World
Blog post from HuggingFace
RynnEC is a novel multimodal large language model (MLLM) developed by the Alibaba DAMO Academy, designed to enhance embodied cognition through video-centric object and spatial understanding. Unlike traditional models trained on internet-scale images, RynnEC focuses on egocentric video data to improve fine-grained visual understanding and spatial awareness crucial for real-world robotic tasks. It operates without explicit 3D inputs, using RGB videos to map user queries into semantic masks, facilitating seamless integration into embodied agents. RynnEC's training was supported by a scalable data pipeline that converts raw videos into various embodied cognition tasks, including object captioning and spatial reasoning, using 20,000 videos from diverse home environments. Through a structured, four-stage training process, RynnEC achieves significant improvements in object and spatial cognition, as evidenced by its superior performance on the RynnEC-Bench benchmark, outperforming other advanced MLLMs like Gemini-2.5 Pro. This advancement positions RynnEC as a powerful tool for enhancing the interactivity and cognitive capabilities of robots in complex real-world scenarios.