Company
Date Published
Author
-
Word count
991
Language
English
Hacker News points
None

Summary

The task of retrieval augmented generation (RAG) on visual content, such as slide decks, is becoming increasingly important with the advent of multi-modal LLMs like GPT-4V. Researchers have developed two approaches to tackle this problem: multi-modal embeddings and multi-vector retriever. The former extracts slides as images, uses multi-model embeddings to embed each image, retrieves relevant slide(s) based on user input, and passes those images to GPT-4V for answer synthesis. The latter extracts slides as images, uses GPT-4V to summarize each image, embeds the image summaries with a link to the original images, retrieves relevant images based on similarity between the image summary and user input, and finally passes those images to GPT-4V for answer synthesis. Evaluating these methods using a public benchmark, researchers found that multi-modal approaches far exceed the performance of text-only RAG, with notable improvements seen in both approaches over text-only RAG. The central challenge remains retrieving the correct image, which is improved by image summarization but comes with higher complexity and cost. To aid in testing and deployment, a template for creating multi-modal RAG apps using Chroma and OpenCLIP embeddings has been released.