Multi-modal RAG on slide decks

Company

LangChain

Date Published

Dec. 6, 2023

Author

Word count

991

Language

English

Hacker News points

None

URL

blog.langchain.dev/multi-modal-rag-template

Summary

The task of retrieval augmented generation (RAG) on visual content, such as slide decks, is becoming increasingly important with the advent of multi-modal LLMs like GPT-4V. Researchers have developed two approaches to tackle this problem: multi-modal embeddings and multi-vector retriever. The former extracts slides as images, uses multi-model embeddings to embed each image, retrieves relevant slide(s) based on user input, and passes those images to GPT-4V for answer synthesis. The latter extracts slides as images, uses GPT-4V to summarize each image, embeds the image summaries with a link to the original images, retrieves relevant images based on similarity between the image summary and user input, and finally passes those images to GPT-4V for answer synthesis. Evaluating these methods using a public benchmark, researchers found that multi-modal approaches far exceed the performance of text-only RAG, with notable improvements seen in both approaches over text-only RAG. The central challenge remains retrieving the correct image, which is improved by image summarization but comes with higher complexity and cost. To aid in testing and deployment, a template for creating multi-modal RAG apps using Chroma and OpenCLIP embeddings has been released.