LLM as a Judge: The Complete Guide | Galtea Blog
Blog post from Galtea
LLM-as-a-judge is a technique that uses one language model to evaluate another model's outputs against a specific rubric, making AI evaluation scalable for chatbots, retrieval-augmented generation (RAG) systems, and agents. This method relies on the MT-Bench paper, which demonstrated that GPT-4 aligns with human experts about 80% of the time, similar to human agreement rates. The article outlines three core judging modes—pairwise comparison, single-answer grading with a rubric, and reference-based grading—each suitable for different evaluation needs. The practice is valuable for scaling evaluations but comes with biases such as position, verbosity, and self-preference biases. Calibration against a labeled gold set is crucial to ensure the reliability of the LLM judge, which should ideally be used as a complement to human evaluators, especially in contexts where the cost of missed failures is high. The article emphasizes the importance of well-defined rubrics and iterative prompt optimization to enhance the judge's alignment with human evaluations, and it advises against using LLM judges in situations where deterministic correctness checks suffice or where the cost of errors is prohibitive.