Benchmarking Models for Multi-modal Search
Blog post from Marqo
Selecting the right models for multi-modal search involves balancing performance factors such as relevance, computational resources, and inference speed, along with context length, image input dimensions, and embedding size, all of which impact quality and latency. CLIP (Contrastive Language Image Pretraining) is a key framework in this domain, enabling the creation of visual representations using natural language, and facilitating cross-modal search through embeddings in a shared latent space. Performance benchmarks are conducted using specific GPUs and datasets, with ViT-B-32, ViT-L-14, and xlm-roberta-large-ViT-H-14 models recommended based on varying needs for latency and retrieval performance. A strategic selection process is emphasized, involving filtering based on latency, storage, and memory constraints, followed by evaluating models against domain-specific benchmarks to ensure optimal performance for tasks such as product search.