Exploring state-of-the-art LLMs as Judges
Blog post from Galtea
The study explores the use of large language models (LLMs) as automated judges to evaluate the performance of other models, offering a scalable alternative to human evaluation. The research assesses various models, including Glider, Selene-1-Mini-Llama-3.1-8B, GPT-4o, and Claude 3.5 Sonnet, across different datasets using metrics such as Pearson Correlation Coefficient and Macro F1 Score. Glider and Selene stand out among smaller models for their accuracy but demand more computational resources for inference compared to models like Phimini and FlowJudge. In red teaming scenarios, where models are tested against risky prompts, GPT-4o and Claude 3.5 Sonnet excel, highlighting a performance gap between them and smaller models. Despite this, Glider and Selene show promise in various tasks, with Selene demonstrating strong multilingual capabilities. The study emphasizes the potential of LLM-as-a-judge systems for cost-effective model evaluation and suggests future research directions, including synthetic dataset generation and enhanced fine-tuning techniques to improve model performance and reliability across diverse linguistic contexts.