Semantic Answer Similarity: The Smarter Metric to Score Question Answering Predictions

Company

deepset

Date Published

Oct. 28, 2021

Author

Isabelle Nguyen

Word count

1674

Language

English

Hacker News points

None

URL

www.deepset.ai/blog/semantic-answer-similarity-to-evaluate-qa

Summary

This new metric for evaluating question answering systems is called Semantic Answer Similarity (SAS). SAS measures the semantic similarity between two answer strings, rather than just their lexical overlap. This makes it a better approximation of human judgment than existing metrics like Exact Match (EM) and F1. SAS uses a cross-encoder architecture that leverages a pre-trained semantic text similarity model to assess the similarity of two strings. The SAS metric returns a score between zero and one, with higher scores indicating greater semantic similarity. To use SAS in Haystack, users can initialize the SAS model together with the EvalAnswers() node and run the pipeline to evaluate their question answering system. While SAS has its strengths, it also has limitations, such as potentially awarding high scores to strings that are semantically similar but not accurate. Nevertheless, SAS can provide a better understanding of how well a question answering system is doing compared to existing metrics like F1 and EM.