Company
Date Published
Author
-
Word count
1016
Language
English
Hacker News points
None

Summary

Retrieval augmented generation (RAG) in large language model (LLM) app development has expanded to include visual content from slide decks, enabled by multi-modal LLMs like GPT-4V. This advancement allows for interactive chat and Q&A by retrieving and synthesizing information from visual data. Two main approaches to multi-modal RAG in slides are multi-modal embeddings, which involve embedding slide images to retrieve relevant content, and multi-vector retrievers, which summarize images before retrieval. While multi-modal embeddings offer simplicity, they face challenges in distinguishing visually similar slides, whereas image summarization, though more complex and costly, enhances retrieval accuracy. A public benchmark evaluation using a Datadog presentation demonstrated that multi-modal methods significantly outperform text-only RAG, achieving accuracy scores of 60% and 90% compared to 20% for text-only approaches. The study also highlighted the effectiveness of GPT-4V in extracting structured data from images and emphasized the importance of accurate image retrieval for successful question answering. To support further exploration and deployment, a template leveraging Chroma and OpenCLIP multi-modal embeddings has been released.