What is 4M? Apple's Massively Multimodal Masked Modeling
Blog post from Roboflow
Massively Multimodal Masked Modeling (4M), introduced by Apple in 2024, represents a significant advancement in multimodal machine learning by integrating the capabilities of large language models to address the limitations of traditional vision models that are typically constrained to single modalities and tasks. The 4M model employs a unified Transformer encoder-decoder architecture with a multimodal training scheme that utilizes a masked modeling objective across diverse input and output modalities including text, images, geometric, and semantic data. This innovative approach enables scalability and versatility by transforming all modalities into discrete tokens and performing multimodal masked modeling on a small subset of these tokens, allowing 4M to efficiently handle a wide range of vision tasks, excel in fine-tuning for new tasks or modalities, and function as a generative model conditioned on different modalities. The framework is trained on a diverse pseudo-labeled dataset to encompass various critical modalities, facilitating rich interactions and semantically conditioned generation. Despite some limitations with larger models, 4M demonstrates robust performance across numerous vision tasks and shows potential for further enhancement through additional modalities and improved training datasets.