Company
Date Published
Author
Frederik Hvilshøj
Word count
2211
Language
English
Hacker News points
None

Summary

Over the past few months, the Encord machine learning team has developed what they claim to be the world's largest open-source multimodal dataset, designed to support the development of models that integrate text, images, video, audio, and 3D point clouds. This dataset aims to facilitate advancements in multimodal AI by providing a clean and extensive resource for open-source development. The process involved sourcing data from multiple modalities, using retrieval models to align the data, and enhancing data quality through human annotation. They also created a retrieval model capable of embedding all modalities into a common space, evaluated through public benchmarks and a newly built dataset for audio-point cloud embeddings. A baseline retrieval model was trained, demonstrating that high-quality data can outperform larger parameter models in cross-modal retrieval tasks. The Encord team hopes that sharing their methodology will aid others in constructing similar datasets and furthering multimodal AI innovation.