Which LLM writes the best analytical SQL?
Blog post from Tinybird
In a comprehensive evaluation of 19 large language models (LLMs) alongside a human engineer, a benchmark was established to assess the models' ability to generate analytical SQL queries for a 200 million-row dataset derived from GitHub events. This benchmark, which uses 50 analytical questions inspired by ClickHouse, measures success rates, efficiency, query latency, and other performance metrics compared to human-generated queries. The dataset is stored in Tinybird, allowing for interactive performance measurement. Key findings indicate that while LLMs are competent in producing SQL queries, they often struggle with generating queries that are both accurate and efficient, particularly as prompt complexity increases. Human engineers still outperform LLMs in efficiency, especially in real-world analytics scenarios where semantic correctness and optimized resource usage are crucial. The study highlights the ongoing challenge of adapting LLMs for precise data logic, emphasizing the need for human oversight and the importance of SQL proficiency. The benchmark serves as a tool for continuous evaluation, guiding improvements in LLM-generated SQL for interactive analytics experiences, with the results publicly accessible and open to contributions for further model testing.