Improving retrieval with LLM-as-a-judge
Blog post from Vespa
The blog post explores the use of large language models (LLMs) as judges to improve the evaluation of retrieval systems, offering a cost-effective and scalable alternative to human judgment. By creating a reusable relevance dataset, the post outlines a methodology for systematically assessing retrieval systems' effectiveness, using GPT-4 to align its judgments with human preferences for search.vespa.ai. The process involves building a small labeled dataset, prompting the LLM to judge query-document pairs, and comparing its judgments with human assessments. The post highlights the potential of LLMs to enhance retrieval evaluations, citing experiments where GPT-4's judgments showed a strong correlation with human labels, allowing for more efficient experimentation with retrieval methods and parameters. With a dataset of over 10,000 query-passage pairs, the post demonstrates the utility of LLMs in generating relevance judgments that can be used to optimize retrieval systems without the high costs associated with human labeling, thereby facilitating quicker iterations and improved search relevancy.