Home / Companies / Galileo / Blog / Post Details
Content Deep Dive

How to Calibrate Your LLM Judge With Human Annotations

Blog post from Galileo

Post Details
Company
Date Published
Author
Pratik Bhavsar
Word Count
2,593
Language
English
Hacker News Points
-
Summary

An LLM judge, or large language model judge, requires continuous calibration rather than a one-time setup to maintain its accuracy and relevance in dynamic production environments. Over time, factors such as model updates, prompt drift, and domain shifts can lead to judge drift, where the model's evaluations no longer align with human expert judgments, resulting in compounding errors in downstream applications. To counteract this, the process of calibration involves using stratified sampling to capture a representative range of outputs, engaging subject matter experts (SMEs) to provide feedback and corrections, and updating anchor examples and rubrics based on this input. Inter-rater reliability (IRR) metrics, like Cohen's kappa, are crucial for assessing the degree of alignment between the judge and human reviewers, as they account for chance agreement and provide a more accurate measure of judge performance than raw agreement rates. Continuous calibration, therefore, involves a systematic loop of sampling, feedback collection, anchor updating, and validation to ensure the judge remains aligned with human expectations and adapts to evolving production demands.