Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

BidirLM: Turning Generative LLMs into the Best Open-Source Omnimodal Encoders

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Nicolas-BZRD and Théo Deschamps-Berger
Word Count
1,772
Language
-
Hacker News Points
-
Summary

BidirLM is an innovative open-source project that transforms generative language models into powerful omnimodal encoders by adapting causal decoder models into bidirectional encoders. The process involves a two-phase pipeline that starts with Masked Next-Token Prediction (MNTP) to enable the use of bidirectional context, followed by contrastive training to enhance embedding quality. To address challenges like catastrophic forgetting when scaling without original data, the project employs strategies such as linear weight merging and multi-domain data mixtures, significantly improving cross-domain knowledge retention. The creators further advanced the project by merging weights from specialized models like vision and audio into their text encoder, resulting in BidirLM-Omni, a compact model that excels in handling text, images, and audio, outperforming both omnimodal and unimodal specialists in standard benchmarks. The BidirLM approach is modular, allowing for incremental integration of new specialized models, offering a cost-effective and flexible alternative to traditional multimodal encoder training.