Home / Companies / Deepgram / Blog / Post Details
Content Deep Dive

HellaSwag: Understanding the LLM Benchmark for Commonsense Reasoning

Blog post from Deepgram

Post Details
Company
Date Published
Author
Brad Nikkel
Word Count
835
Language
English
Hacker News Points
-
Summary

HellaSwag is a large language model (LLM) benchmark designed by Zellers et al. in 2019 to evaluate commonsense reasoning in LLMs. The dataset tests common-sense natural language inference (NLI) about physical situations and uses adversarial filtering to generate deceptive, challenging incorrect answers for a multi-choice test setting. When initially released, state-of-the-art models like BERT had poor commonsense reasoning, with human accuracy soaring above 95% while these cutting-edge models mustered accuracies below 50%. Since its release, HellaSwag has pushed the field to evolve benchmarks and improve LLM performance.