Papers
arxiv:2503.02240

OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale

Published on Mar 4
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

Text-to-SQL, the task of translating natural language questions into SQL queries, plays a crucial role in enabling non-experts to interact with databases. While recent advancements in large language models (LLMs) have significantly enhanced text-to-SQL performance, existing approaches face notable limitations in real-world text-to-SQL applications. Prompting-based methods often depend on closed-source LLMs, which are expensive, raise privacy concerns, and lack customization. Fine-tuning-based methods, on the other hand, suffer from poor generalizability due to the limited coverage of publicly available training data. To overcome these challenges, we propose a novel and scalable text-to-SQL data synthesis framework for automatically synthesizing large-scale, high-quality, and diverse datasets without extensive human intervention. Using this framework, we introduce SynSQL-2.5M, the first million-scale text-to-SQL dataset, containing 2.5 million samples spanning over 16,000 synthetic databases. Each sample includes a database, SQL query, natural language question, and chain-of-thought (CoT) solution. Leveraging SynSQL-2.5M, we develop OmniSQL, a powerful open-source text-to-SQL model available in three sizes: 7B, 14B, and 32B. Extensive evaluations across nine datasets demonstrate that OmniSQL achieves state-of-the-art performance, matching or surpassing leading closed-source and open-source LLMs, including GPT-4o and DeepSeek-V3, despite its smaller size. We release all code, datasets, and models to support further research.

Community

Hi everyone,

We are thrilled to introduce SynSQL-2.5M, a high-quality synthetic text-to-SQL dataset featuring:

  • 2,544,390 diverse and complex text-to-SQL samples, each consisting of a <database, question, SQL query, chain-of-thought solution> quad.
  • Coverage of 16,583 synthetic databases from realistic scenarios.
  • A wide range of SQL complexity levels: simple, moderate, complex, highly complex, from single-table queries to advanced multi-table joins, functions, and common table expressions.
  • A variety of linguistic styles in natural language questions: formal, colloquial, imperative, interrogative, descriptive, concise, vague, metaphorical, and conversational.
  • Chain-of-thought (CoT) solutions provided for all data samples.

As of March 2025, SynSQL-2.5M is the largest and most diverse synthetic text-to-SQL dataset to date. It represents a significant milestone in the text-to-SQL community. We encourage researchers, practitioners, and data enthusiasts to explore and build models using this dataset.

Let's dive in!

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.02240 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.