RynnEC: Bringing MLLMs into Embodied World

Post Details

Company

HuggingFace

Date Published

Aug. 14, 2025

Author

Ronghao Dang, YuqianYuan, yunxuan mao, Kehan Li, jiangpin, zhikai wang, and Xin Li

Word Count

1,382

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/Alibaba-DAMO-Academy/rynnec

Summary

RynnEC is a novel multimodal large language model (MLLM) developed by the Alibaba DAMO Academy, designed to enhance embodied cognition through video-centric object and spatial understanding. Unlike traditional models trained on internet-scale images, RynnEC focuses on egocentric video data to improve fine-grained visual understanding and spatial awareness crucial for real-world robotic tasks. It operates without explicit 3D inputs, using RGB videos to map user queries into semantic masks, facilitating seamless integration into embodied agents. RynnEC's training was supported by a scalable data pipeline that converts raw videos into various embodied cognition tasks, including object captioning and spatial reasoning, using 20,000 videos from diverse home environments. Through a structured, four-stage training process, RynnEC achieves significant improvements in object and spatial cognition, as evidenced by its superior performance on the RynnEC-Bench benchmark, outperforming other advanced MLLMs like Gemini-2.5 Pro. This advancement positions RynnEC as a powerful tool for enhancing the interactivity and cognitive capabilities of robots in complex real-world scenarios.