How We Built the World's Largest Multimodal Dataset

Company

Encord

Date Published

Oct. 15, 2025

Author

Frederik Hvilshøj

Word count

2211

Language

English

Hacker News points

None

URL

encord.com/blog/how-we-built-multimodal-dataset-emm1

Summary

Over the past few months, the Encord machine learning team has developed what they claim to be the world's largest open-source multimodal dataset, designed to support the development of models that integrate text, images, video, audio, and 3D point clouds. This dataset aims to facilitate advancements in multimodal AI by providing a clean and extensive resource for open-source development. The process involved sourcing data from multiple modalities, using retrieval models to align the data, and enhancing data quality through human annotation. They also created a retrieval model capable of embedding all modalities into a common space, evaluated through public benchmarks and a newly built dataset for audio-point cloud embeddings. A baseline retrieval model was trained, demonstrating that high-quality data can outperform larger parameter models in cross-modal retrieval tasks. The Encord team hopes that sharing their methodology will aid others in constructing similar datasets and furthering multimodal AI innovation.