Home / Companies / Braintrust / Blog / Post Details
Content Deep Dive

What is an LLM-as-a-judge? When to use it (and when to use deterministic evals)

Blog post from Braintrust

Post Details
Company
Date Published
Author
-
Word Count
3,008
Language
English
Hacker News Points
-
Summary

LLM-as-a-judge is an evaluation technique that employs one large language model (LLM) to assess the outputs of another based on clearly defined, natural-language criteria, such as relevance, factual accuracy, and tone, which manual review or traditional metrics like BLEU or ROUGE might miss. This method scales evaluations efficiently, allowing for structured scoring or verdicts on subjective dimensions that rule-based checks cannot reliably measure. Despite its advantages, the technique requires careful implementation to avoid biases and inaccuracies, such as factual verification without reference, position bias, verbosity bias, and self-enhancement bias. Using a calibration set, running adversarial tests, and maintaining ongoing human spot checks are recommended to ensure reliable results. Braintrust offers a comprehensive platform to facilitate this evaluation approach, supporting teams in creating custom scorers, running evaluations across development and production, and maintaining consistent quality standards through CI/CD integration.