🧠 SQaLe: Enabling new Text-to-SQL models with our massive dataset

Post Details

Company

HuggingFace

Date Published

Nov. 19, 2025

Author

Cornelius Wolff, Daniel Gomm, and Madelon Hulsebos

Word Count

944

Company Posts That Month

49

Language

-

Hacker News Points

-

Source URL

huggingface.co/blog/cwolff/sqale

Summary

SQaLe is an extensive text-to-SQL dataset designed to overcome the limitations of existing resources by providing a large, diverse, and realistic foundation for training and evaluating models that convert natural language into SQL queries. Built from over 139,000 database schemas and more than 500,000 validated triples of schema, question, and query, SQaLe reflects real-world schema complexity and is accessible via the Hugging Face Hub for research and model fine-tuning. The dataset addresses the gap in current benchmarks by offering a scale that supports large language models (LLMs) and a realism that mirrors production database environments, with validated SQL queries ensuring consistency with corresponding natural-language questions. SQaLe's creation involved extending schemas sourced from SchemaPile and generating diverse natural-language questions and SQL queries, culminating in a resource that supports the training and evaluation of text-to-SQL models, schema understanding, and benchmark testing in realistic database contexts.

Trends Found in this Post

Trend	Post Mentions	Total Month Mentions	Posts	Companies	MoM
LLM	4	5,556	752	184	+14%
AI Model Fine-tuning	1	558	140	61	-27%