Home / Companies / Cleanlab / Blog / Post Details
Content Deep Dive

LLM Structured Output Benchmarks are Riddled with Mistakes

Blog post from Cleanlab

Post Details
Company
Date Published
Author
Hui Wen Goh and Jonas Mueller
Word Count
1,659
Language
English
Hacker News Points
-
Summary

In a recent analysis of Structured Outputs from leading Large Language Models (LLMs), significant errors were found in the ground-truth outputs of popular benchmark datasets, highlighting the challenges in producing accurate annotations even by human annotators. To address this, four new Structured Outputs benchmarks with verified high-quality ground-truth outputs have been introduced, covering diverse applications such as Data Table Analysis, Insurance Claims Extraction, Financial Entities Extraction, and PII Extraction. These benchmarks are designed to facilitate more reliable evaluation of LLMs’ capabilities by providing clear and consistent annotations. The study compared the performance of various LLMs, including OpenAI's GPT-5 and GPT-4.1-mini, and Google's Gemini models, using metrics like Field Accuracy and Output Accuracy. Despite its higher cost, GPT-5 demonstrated superior performance in Financial Entities Extraction, while Gemini-3-Pro excelled in Data Table Analysis. The analysis suggests that while smaller models like GPT-4.1-mini and Gemini-2.5-Flash can offer cost and latency advantages, OpenAI's models are generally recommended for Structured Output tasks due to their cost-effectiveness, though accuracy may vary depending on the specific use case.