Home / Companies / deepset / Blog / Post Details
Content Deep Dive

Evaluating RAG Part II: How to Evaluate a Large Language Model (LLM)

Blog post from deepset

Post Details
Company
Date Published
Author
Isabelle Nguyen
Word Count
1,715
Language
English
Hacker News Points
-
Summary

Evaluating the output of large language models (LLMs) is a challenging task due to their creative nature and contextual dependencies. To address this, various evaluation methods have been proposed, including lexical metrics such as BLEU, ROUGE, and F1, which measure precision and recall in different ways. However, these metrics have limitations, such as not recognizing semantic similarity or being easily defeated by word order changes. More promising approaches include transformer-based metrics like semantic answer similarity (SAS), which quantify the similarity of the LLM's prediction to the ground truth, regardless of vocabulary. Another important aspect is user feedback, which remains essential in machine learning evaluation, providing valuable insights into real-world usage and pain points. An ideal LLM metric should be adaptable to different use cases, weighing factors such as helpfulness, brevity, and groundedness, and allowing users to compare models based on these dimensions.