NEO-unify: Building Native Multimodal Unified Models End to End
Blog post from HuggingFace
SenseTime, in collaboration with NTU, introduces NEO-unify, a groundbreaking multimodal AI model that moves beyond traditional vision encoders and variational autoencoders by directly engaging with native inputs such as pixels and words. This end-to-end paradigm utilizes a near-lossless visual interface and a Mixture-of-Transformer (MoT) architecture to synergize understanding and generation, employing autoregressive cross-entropy for text and pixel flow matching for vision. Remarkably, NEO-unify maintains both semantic and pixel fidelity without pre-trained encoders, demonstrating strong image editing capabilities and high data-scaling efficiency. By integrating perception and generation in a unified model, NEO-unify aims to enable native multi-modal reasoning and world modeling, representing a significant step towards developing AI systems that inherently comprehend and operate across different modalities without translation.