G-Eval is a research-backed evaluation framework that allows you to create custom LLM-as-a-judge metrics to evaluate any natural language generation task by simply writing an evaluation criteria in natural language. It leverages an automatic chain-of-thought (CoT) approach to decompose the criteria and evaluate LLM outputs through a three-step process: Evaluation Step Generation, Judging, and Scoring. G-Eval was first introduced in the paper "NLG Evaluation using GPT-4 with Better Human Alignment" as a superior alternative to traditional reference-based metrics like BLEU and ROUGE, which struggles with subjective and open-ended tasks that require creativity, nuance, and an understanding of word semantics. G-Eval makes great LLM evaluation metrics because it is accurate, easily tunable, and surprisingly consistent across runs. It addresses common pitfalls of LLM-based evaluation such as inconsistent scoring, lack of fine-grained judgment, verbosity bias, narcissistic bias, and more. G-Eval can be implemented in 5 lines of code using DeepEval. The framework provides a flexible way to define custom metrics tailored to your specific LLM application. It is well-suited for subjective and open-ended tasks like tone, helpfulness, or persuasiveness. G-Eval can also be integrated within a Deep Acyclic Graph (DAG) setup to combine the interpretability of decision trees with the nuance of G-Eval scoring. The most commonly used G-Eval metrics include Answer Correctness, Coherence, Tonality, Safety, and Custom RAG evaluation.