Home / Companies / Galtea / Blog / Post Details
Content Deep Dive

LLM as a Judge prompts: templates, rubrics, and best practices | Galtea Blog

Blog post from Galtea

Post Details
Company
Date Published
Author
-
Word Count
4,027
Language
English
Hacker News Points
-
Summary

The text provides an in-depth guide on creating and optimizing Large Language Model (LLM) judge prompts, which are small programs used to evaluate AI-generated content based on specific criteria. A successful LLM-as-a-judge prompt consists of four essential parts: a criterion definition using domain-specific vocabulary, a reasoning structure for claim-by-claim evaluation, a deterministic scoring rule, and handling of edge cases. The guide emphasizes the importance of precise rubric design to ensure accurate and reliable judgments, cautioning against vague language or overly complex rationale structures that can lead to biased or inconsistent results. It also discusses common pitfalls in designing judge prompts, such as implicit length preference or mixing generator instructions with judge instructions, and suggests best practices for calibration, including versioning prompts alongside gold sets to track and attribute any alignment regression. The text advises against using custom prompts when deterministic checks are sufficient or when calibrated, published prompts are available, and underscores the necessity of treating judge prompts as hypotheses that require rigorous testing and refinement before deployment.