LLM Structured Output Benchmarks are Riddled with Mistakes

Post Details

Company

Cleanlab

Date Published

Dec. 5, 2025

Author

Hui Wen Goh and Jonas Mueller

Word Count

1,659

Language

English

Hacker News Points

-

Source URL

cleanlab.ai/blog/structured-output-benchmark

Summary

In a recent analysis of Structured Outputs from leading Large Language Models (LLMs), significant errors were found in the ground-truth outputs of popular benchmark datasets, highlighting the challenges in producing accurate annotations even by human annotators. To address this, four new Structured Outputs benchmarks with verified high-quality ground-truth outputs have been introduced, covering diverse applications such as Data Table Analysis, Insurance Claims Extraction, Financial Entities Extraction, and PII Extraction. These benchmarks are designed to facilitate more reliable evaluation of LLMs’ capabilities by providing clear and consistent annotations. The study compared the performance of various LLMs, including OpenAI's GPT-5 and GPT-4.1-mini, and Google's Gemini models, using metrics like Field Accuracy and Output Accuracy. Despite its higher cost, GPT-5 demonstrated superior performance in Financial Entities Extraction, while Gemini-3-Pro excelled in Data Table Analysis. The analysis suggests that while smaller models like GPT-4.1-mini and Gemini-2.5-Flash can offer cost and latency advantages, OpenAI's models are generally recommended for Structured Output tasks due to their cost-effectiveness, though accuracy may vary depending on the specific use case.