Multi-modal RAG on slide decks

Post Details

Company

LangChain

Date Published

Dec. 6, 2023

Author

-

Word Count

1,016

Language

English

Hacker News Points

-

Source URL

www.langchain.com/blog/multi-modal-rag-template

Summary

Retrieval augmented generation (RAG) in large language model (LLM) app development has expanded to include visual content from slide decks, enabled by multi-modal LLMs like GPT-4V. This advancement allows for interactive chat and Q&A by retrieving and synthesizing information from visual data. Two main approaches to multi-modal RAG in slides are multi-modal embeddings, which involve embedding slide images to retrieve relevant content, and multi-vector retrievers, which summarize images before retrieval. While multi-modal embeddings offer simplicity, they face challenges in distinguishing visually similar slides, whereas image summarization, though more complex and costly, enhances retrieval accuracy. A public benchmark evaluation using a Datadog presentation demonstrated that multi-modal methods significantly outperform text-only RAG, achieving accuracy scores of 60% and 90% compared to 20% for text-only approaches. The study also highlighted the effectiveness of GPT-4V in extracting structured data from images and emphasized the importance of accurate image retrieval for successful question answering. To support further exploration and deployment, a template leveraging Chroma and OpenCLIP multi-modal embeddings has been released.