BidirLM: Turning Generative LLMs into the Best Open-Source Omnimodal Encoders
Blog post from HuggingFace
BidirLM is an innovative open-source project that transforms generative language models into powerful omnimodal encoders by adapting causal decoder models into bidirectional encoders. The process involves a two-phase pipeline that starts with Masked Next-Token Prediction (MNTP) to enable the use of bidirectional context, followed by contrastive training to enhance embedding quality. To address challenges like catastrophic forgetting when scaling without original data, the project employs strategies such as linear weight merging and multi-domain data mixtures, significantly improving cross-domain knowledge retention. The creators further advanced the project by merging weights from specialized models like vision and audio into their text encoder, resulting in BidirLM-Omni, a compact model that excels in handling text, images, and audio, outperforming both omnimodal and unimodal specialists in standard benchmarks. The BidirLM approach is modular, allowing for incremental integration of new specialized models, offering a cost-effective and flexible alternative to traditional multimodal encoder training.