Debugging Our Docs RAG, Part 1: Evaluating a Production RAG System
Blog post from dltHub
In a detailed evaluation of the internal documentation chatbot "dhelp," the dlt team identified several performance issues that had emerged as the bot's usage grew, particularly as GPT-4's quality was questioned. The team constructed an evaluation dataset from real user queries to assess the system's effectiveness, revealing that only 3 out of 14 queries were resolved satisfactorily. The primary failure modes identified were hallucinated content, unclear separation between retrieval and generation, and technically correct but unhelpful answers. These findings highlighted that the system's shortcomings were not due to a singular cause, suggesting that improvements could be achieved through reconfiguring the Retrieval-Augmented Generation (RAG) system without altering the product's surface. By focusing on changes like generative and embedding model choices, chunking strategies, and system prompt design, the team aims to enhance the bot's performance. The next steps involve testing newer models to isolate and improve the generation layer, using the established evaluation set to measure progress systematically.