Home / Companies / Tinybird / Blog / Post Details
Content Deep Dive

We graded 19 LLMs on SQL. You graded us.

Blog post from Tinybird

Post Details
Company
Date Published
Author
Victor Ramirez Garcia
Word Count
1,361
Language
English
Hacker News Points
-
Summary

A recent benchmark comparing the performance of 19 large language models and one human in generating analytical SQL queries has sparked a variety of feedback from online communities. Key criticisms included the flawed scoring method that penalized models for generating correct but structurally different SQL, the low complexity of benchmark queries that did not reflect real-world analytical challenges, and the unfair penalization of certain SQL function choices that prioritized exactness over performance. The benchmark used a controlled dataset to eliminate variables, but this also simplified the task compared to real-world scenarios where schemas are often ambiguous. Suggestions for improvement included incorporating a second grading pass based on result equivalence, introducing more complex and realistic queries, and allowing for community-sourced benchmarks to simulate more challenging data environments. The benchmark emphasized efficiency, considering both execution speed and data scanned, but this was not always clearly communicated. Moving forward, the team plans to refine the benchmark by integrating community feedback, enhancing scoring methodologies, and considering real-world complexities to better evaluate the capabilities of language models in writing SQL.