Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

RynnEC: Bringing MLLMs into Embodied World

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Ronghao Dang, YuqianYuan, yunxuan mao, Kehan Li, jiangpin, zhikai wang, and Xin Li
Word Count
1,382
Language
-
Hacker News Points
-
Summary

RynnEC is a novel multimodal large language model (MLLM) developed by the Alibaba DAMO Academy, designed to enhance embodied cognition through video-centric object and spatial understanding. Unlike traditional models trained on internet-scale images, RynnEC focuses on egocentric video data to improve fine-grained visual understanding and spatial awareness crucial for real-world robotic tasks. It operates without explicit 3D inputs, using RGB videos to map user queries into semantic masks, facilitating seamless integration into embodied agents. RynnEC's training was supported by a scalable data pipeline that converts raw videos into various embodied cognition tasks, including object captioning and spatial reasoning, using 20,000 videos from diverse home environments. Through a structured, four-stage training process, RynnEC achieves significant improvements in object and spatial cognition, as evidenced by its superior performance on the RynnEC-Bench benchmark, outperforming other advanced MLLMs like Gemini-2.5 Pro. This advancement positions RynnEC as a powerful tool for enhancing the interactivity and cognitive capabilities of robots in complex real-world scenarios.