LLM Structured Output Benchmarks are Riddled with Mistakes
Blog post from Cleanlab
In a recent analysis of Structured Outputs from leading Large Language Models (LLMs), significant errors were found in the ground-truth outputs of popular benchmark datasets, highlighting the challenges in producing accurate annotations even by human annotators. To address this, four new Structured Outputs benchmarks with verified high-quality ground-truth outputs have been introduced, covering diverse applications such as Data Table Analysis, Insurance Claims Extraction, Financial Entities Extraction, and PII Extraction. These benchmarks are designed to facilitate more reliable evaluation of LLMs’ capabilities by providing clear and consistent annotations. The study compared the performance of various LLMs, including OpenAI's GPT-5 and GPT-4.1-mini, and Google's Gemini models, using metrics like Field Accuracy and Output Accuracy. Despite its higher cost, GPT-5 demonstrated superior performance in Financial Entities Extraction, while Gemini-3-Pro excelled in Data Table Analysis. The analysis suggests that while smaller models like GPT-4.1-mini and Gemini-2.5-Flash can offer cost and latency advantages, OpenAI's models are generally recommended for Structured Output tasks due to their cost-effectiveness, though accuracy may vary depending on the specific use case.