Automatically Reduce Incorrect LLM Responses across OpenAI's SimpleQA Benchmark via Trustworthiness Scoring

Post Details

Company

Cleanlab

Date Published

Nov. 7, 2024

Author

Hui Wen Goh, Jonas Mueller

Word Count

1,107

Language

English

Hacker News Points

-

Source URL

cleanlab.ai/blog/simpleqa

Summary

Ensuring factual accuracy in Large Language Models (LLMs) is crucial due to their tendency to produce incorrect answers, known as hallucinations. OpenAI has introduced the SimpleQA benchmark, a dataset of over 4,000 fact-based questions, to assess the accuracy of LLMs like GPT-4o and their ability to abstain from answering when unsure. Despite GPT-4o's attempts to avoid incorrect answers by stating "I don't know," it still incorrectly answers 58.5% of queries. The Trustworthy Language Model (TLM) is used to enhance LLM performance by scoring trustworthiness, flagging low-confidence responses, and providing fallback answers to improve accuracy. Automated response improvement with TLM further reduces incorrect responses without altering the LLM model or prompts. Applying stringent trustworthiness thresholds can significantly lower error rates, although it may also reduce correct response rates. These methods demonstrate a general approach to improving LLM reliability, applicable to various models, including GPT-4o mini, by leveraging TLM's capabilities to achieve better accuracy and response reliability.