Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

NEO-unify: Building Native Multimodal Unified Models End to End

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Haiwen Diao, Lewei Lu, and Ziwei Liu
Word Count
623
Language
-
Hacker News Points
-
Summary

SenseTime, in collaboration with NTU, introduces NEO-unify, a groundbreaking multimodal AI model that moves beyond traditional vision encoders and variational autoencoders by directly engaging with native inputs such as pixels and words. This end-to-end paradigm utilizes a near-lossless visual interface and a Mixture-of-Transformer (MoT) architecture to synergize understanding and generation, employing autoregressive cross-entropy for text and pixel flow matching for vision. Remarkably, NEO-unify maintains both semantic and pixel fidelity without pre-trained encoders, demonstrating strong image editing capabilities and high data-scaling efficiency. By integrating perception and generation in a unified model, NEO-unify aims to enable native multi-modal reasoning and world modeling, representing a significant step towards developing AI systems that inherently comprehend and operate across different modalities without translation.