How to Calibrate Your LLM Judge With Human Annotations
Blog post from Galileo
An LLM judge, or large language model judge, requires continuous calibration rather than a one-time setup to maintain its accuracy and relevance in dynamic production environments. Over time, factors such as model updates, prompt drift, and domain shifts can lead to judge drift, where the model's evaluations no longer align with human expert judgments, resulting in compounding errors in downstream applications. To counteract this, the process of calibration involves using stratified sampling to capture a representative range of outputs, engaging subject matter experts (SMEs) to provide feedback and corrections, and updating anchor examples and rubrics based on this input. Inter-rater reliability (IRR) metrics, like Cohen's kappa, are crucial for assessing the degree of alignment between the judge and human reviewers, as they account for chance agreement and provide a more accurate measure of judge performance than raw agreement rates. Continuous calibration, therefore, involves a systematic loop of sampling, feedback collection, anchor updating, and validation to ensure the judge remains aligned with human expectations and adapts to evolving production demands.