We graded 19 LLMs on SQL. You graded us.

Post Details

Company

Tinybird

Date Published

May 16, 2025

Author

Victor Ramirez Garcia

Word Count

1,361

Language

English

Hacker News Points

-

Source URL

www.tinybird.co/blog/we-graded-19-llms-on-sql-you-graded-us

Summary

A recent benchmark comparing the performance of 19 large language models and one human in generating analytical SQL queries has sparked a variety of feedback from online communities. Key criticisms included the flawed scoring method that penalized models for generating correct but structurally different SQL, the low complexity of benchmark queries that did not reflect real-world analytical challenges, and the unfair penalization of certain SQL function choices that prioritized exactness over performance. The benchmark used a controlled dataset to eliminate variables, but this also simplified the task compared to real-world scenarios where schemas are often ambiguous. Suggestions for improvement included incorporating a second grading pass based on result equivalence, introducing more complex and realistic queries, and allowing for community-sourced benchmarks to simulate more challenging data environments. The benchmark emphasized efficiency, considering both execution speed and data scanned, but this was not always clearly communicated. Moving forward, the team plans to refine the benchmark by integrating community feedback, enhancing scoring methodologies, and considering real-world complexities to better evaluate the capabilities of language models in writing SQL.