It stands as the largest and most diverse synthetic Text-to-SQL dataset available to-date.
The dataset includes:
- 105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens - Coverage across 100 distinct domains/verticals - Comprehensive array of SQL tasks: data definition, retrieval, manipulation, analytics & reporting - Wide range of SQL complexity levels, including subqueries, single joins, multiple joins, aggregations, window functions, set operations - Database context, including table and view create statements - Natural language explanations of what the SQL query is doing - Contextual tags to optimize model training