How to Calibrate Your LLM Judge With Human Annotations

Post Details

Company

Galileo

Date Published

May 15, 2026

Author

Pratik Bhavsar

Word Count

2,593

Language

English

Hacker News Points

-

Source URL

galileo.ai/blog/calibrate-llm-judge-human-annotations

Summary

An LLM judge, or large language model judge, requires continuous calibration rather than a one-time setup to maintain its accuracy and relevance in dynamic production environments. Over time, factors such as model updates, prompt drift, and domain shifts can lead to judge drift, where the model's evaluations no longer align with human expert judgments, resulting in compounding errors in downstream applications. To counteract this, the process of calibration involves using stratified sampling to capture a representative range of outputs, engaging subject matter experts (SMEs) to provide feedback and corrections, and updating anchor examples and rubrics based on this input. Inter-rater reliability (IRR) metrics, like Cohen's kappa, are crucial for assessing the degree of alignment between the judge and human reviewers, as they account for chance agreement and provide a more accurate measure of judge performance than raw agreement rates. Continuous calibration, therefore, involves a systematic loop of sampling, feedback collection, anchor updating, and validation to ensure the judge remains aligned with human expectations and adapts to evolving production demands.