Home / Companies / Vespa / Blog / Post Details
Content Deep Dive

Improving retrieval with LLM-as-a-judge

Blog post from Vespa

Post Details
Company
Date Published
Author
Jo Kristian Bergum
Word Count
3,729
Language
English
Hacker News Points
-
Summary

The blog post explores the use of large language models (LLMs) as judges to improve the evaluation of retrieval systems, offering a cost-effective and scalable alternative to human judgment. By creating a reusable relevance dataset, the post outlines a methodology for systematically assessing retrieval systems' effectiveness, using GPT-4 to align its judgments with human preferences for search.vespa.ai. The process involves building a small labeled dataset, prompting the LLM to judge query-document pairs, and comparing its judgments with human assessments. The post highlights the potential of LLMs to enhance retrieval evaluations, citing experiments where GPT-4's judgments showed a strong correlation with human labels, allowing for more efficient experimentation with retrieval methods and parameters. With a dataset of over 10,000 query-passage pairs, the post demonstrates the utility of LLMs in generating relevance judgments that can be used to optimize retrieval systems without the high costs associated with human labeling, thereby facilitating quicker iterations and improved search relevancy.