What is an LLM-as-a-judge? When to use it (and when to use deterministic evals)

Post Details

Company

Braintrust

Date Published

Feb. 28, 2026

Author

-

Word Count

3,008

Language

English

Hacker News Points

-

Source URL

www.braintrust.dev/articles/what-is-llm-as-a-judge

Summary

LLM-as-a-judge is an evaluation technique that employs one large language model (LLM) to assess the outputs of another based on clearly defined, natural-language criteria, such as relevance, factual accuracy, and tone, which manual review or traditional metrics like BLEU or ROUGE might miss. This method scales evaluations efficiently, allowing for structured scoring or verdicts on subjective dimensions that rule-based checks cannot reliably measure. Despite its advantages, the technique requires careful implementation to avoid biases and inaccuracies, such as factual verification without reference, position bias, verbosity bias, and self-enhancement bias. Using a calibration set, running adversarial tests, and maintaining ongoing human spot checks are recommended to ensure reliable results. Braintrust offers a comprehensive platform to facilitate this evaluation approach, supporting teams in creating custom scorers, running evaluations across development and production, and maintaining consistent quality standards through CI/CD integration.