File size: 12,201 Bytes
fb8a3fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e8a34d
fb8a3fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eb5ba95
fb8a3fb
 
 
 
acc7645
fb8a3fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0002b39
 
 
 
 
 
 
fb8a3fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62fb090
 
 
 
fb8a3fb
 
 
 
 
 
 
 
 
0002b39
 
fb8a3fb
 
 
 
 
 
 
 
 
 
 
 
 
 
0002b39
 
ccc6cb6
 
 
 
 
 
fb8a3fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32aace6
fb8a3fb
 
32aace6
fb8a3fb
 
 
32aace6
fb8a3fb
5f4b5a6
fb8a3fb
32aace6
 
 
 
fb8a3fb
32aace6
 
 
 
 
 
 
de0a177
 
 
 
 
 
32aace6
 
 
 
 
 
 
fb8a3fb
 
 
32aace6
 
 
 
fb8a3fb
 
32aace6
 
 
fb8a3fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32aace6
fb8a3fb
 
 
 
 
 
 
 
 
e38bf3c
 
976912c
e38bf3c
 
f6de6ab
e38bf3c
fb8a3fb
 
 
 
 
 
 
 
fc25676
fb8a3fb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
---
license: apache-2.0
datasets:
- Clinton/Text-to-sql-v1
- b-mc2/sql-create-context
- gretelai/synthetic_text_to_sql
- knowrohit07/know_sql
metrics:
- rouge
- bleu
- fuzzy_match
- exact_match
base_model:
- google/flan-t5-base
pipeline_tag: text2text-generation
library_name: transformers
language:
- en
tags:
- text2sql
- transformers
- flan-t5
- seq2seq
- qlora
- peft
- fine-tuning
---
# Model Card

<!-- Provide a quick summary of what the model is/does. -->

This model is a fine-tuned version of [Flan-T5 Base](https://huggingface.co/google/flan-t5-base) optimized to convert natural language queries into SQL statements. It leverages **QLoRA (Quantized Low-Rank Adaptation)** with PEFT for efficient adaptation and has been trained on a concatenation of several high-quality text-to-SQL datasets. A live demo is available, and users can clone and run inference directly from Hugging Face.

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This model is designed to generate SQL queries based on a provided natural language context and query. 
It has been fine-tuned using QLoRA with 4-bit quantization and PEFT on a diverse text-to-SQL dataset. 
The model demonstrates significant improvements over the original base model, making it highly suitable for practical text-to-SQL applications.

- **Developed by:** Aarohan Verma
- **Model type:** Seq2Seq / Text-to-Text Generation (SQL Generation)
- **Language(s) (NLP):** English
- **License:** Apache-2.0
- **Finetuned from model:** [google/flan-t5-base](https://huggingface.co/google/flan-t5-base)


### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** [https://huggingface.co/aarohanverma/text2sql-flan-t5-base-qlora-finetuned](https://huggingface.co/aarohanverma/text2sql-flan-t5-base-qlora-finetuned)
- **Demo:** [Gradio Demo](https://huggingface.co/spaces/aarohanverma/text2sql-demo)

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

This model can be used directly for generating SQL queries from natural language inputs. 
It is particularly useful for applications in database querying and natural language interfaces for relational databases.

### Downstream Use 

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

The model can be further integrated into applications such as chatbots, data analytics platforms, and business intelligence tools to automate query generation.

### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

This model is not designed for tasks outside text-to-SQL generation. 
It may not perform well for non-SQL language generation or queries outside the domain of structured data retrieval.

## Risks and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

- **Risks:** Inaccurate SQL generation may lead to unexpected query behavior, especially in safety-critical environments.
- **Limitations:** The model may struggle with highly complex schemas and queries due to limited training data and inherent model capability constraints, particularly for tasks that require deep, domain-specific knowledge.

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users should validate the generated SQL queries before deployment in production systems. 
Consider incorporating human-in-the-loop review for critical applications.

## How to Get Started with the Model

To get started, clone the repository or download the model from Hugging Face, then use the provided example code to run inference. 
Detailed instructions and the live demo are available in this model card.

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The model was fine-tuned on a concatenation of several publicly available text-to-SQL datasets:
1. **[Clinton/Text-to-SQL v1](https://huggingface.co/datasets/Clinton/Text-to-sql-v1)**
2. **[b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)**
3. **[gretelai/synthetic_text_to_sql](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql)**
4. **[knowrohit07/know_sql](https://huggingface.co/datasets/knowrohit07/know_sql)**

**Data Split:**

| **Split**            | **Percentage**     | **Number of Samples**    |
|----------------------|--------------------|--------------------------|
| **Training**         | 85%                | **338,708**              |
| **Validation**       | 5%                 | **19,925**               |
| **Testing**          | 10%                | **39,848**               |


### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

#### Preprocessing

The raw data was preprocessed as follows:
- **Cleaning:** Removal of extra whitespaces/newlines and standardization of columns (renaming to `query`, `context`, and `response`).
- **Filtering:** Dropping examples with missing values and duplicates; retaining only rows where the prompt is ≤ 500 tokens and the response is ≤ 250 tokens.
- **Tokenization:**

Prompts are constructed in the format:
```
Context:
{context}

Query:
{query}

Response:
```
and tokenized with a maximum length of 512 for inputs and 256 for responses using [google/flan-t5-base](https://huggingface.co/google/flan-t5-base)'s tokenizer.

#### Training Hyperparameters

- **Epochs:** 6  
- **Batch Sizes:**  
  Training: 64 per device  
  Evaluation: 64 per device  
- **Gradient Accumulation:** 2 steps  
- **Learning Rate:** 2e-4  
- **Optimizer:** `adamw_bnb_8bit` (memory-efficient variant of AdamW)  
- **LR Scheduler:** Cosine scheduler with a warmup ratio of 10%  
- **Quantization:** 4-bit NF4 (with double quantization) using `torch.bfloat16`  
- **LoRA Parameters:**  
  **Rank (r):** 32  
  **Alpha:** 64  
  **Dropout:** 0.1  
  **Target Modules:** `["q", "v"]`  
- **Checkpointing:**  
  Model saved at the end of every epoch  
  Early stopping with a patience of 2 epochs based on evaluation loss  
- **Reproducibility:** Random seeds are set across Python, NumPy, and PyTorch (seed = 42)

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

The model was evaluated on 39,848 test samples, representing 10% of the dataset.

#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

Evaluation metrics used:
- **ROUGE:** Measures n-gram overlap between generated and reference SQL.
- **BLEU:** Assesses the quality of translation from natural language to SQL.
- **Fuzzy Match Score:** Uses token-set similarity to provide a soft match percentage.
- **Exact Match Accuracy:** Percentage of queries that exactly match the reference SQL.

### Results

The table below summarizes the evaluation metrics comparing the original base model with the fine-tuned model:

| **Metric**                | **Original Model**            | **Fine-Tuned Model**    | **Comments**                                                                         |
|---------------------------|-------------------------------|-------------------------|--------------------------------------------------------------------------------------|
| **ROUGE-1**               | 0.05647                       | **0.75388**             | Over 13× increase; indicates much better content capture.                            |
| **ROUGE-2**               | 0.01563                       | **0.61039**             | Nearly 39× improvement; higher n-gram quality.                                       |
| **ROUGE-L**               | 0.05031                       | **0.72628**             | More than 14× increase; improved sequence similarity.                                |
| **BLEU Score**            | 0.00314                       | **0.47198**             | Approximately 150× increase; demonstrates significant fluency gains.                 |
| **Fuzzy Match Score**     | 13.98%                        | **85.62%**              | Substantial improvement; generated SQL aligns much closer with human responses.      |
| **Exact Match Accuracy**  | 0.00%                         | **18.29%**              | Non-zero accuracy achieved; critical for production-readiness.                       |

#### Summary

The fine-tuned model shows dramatic improvements across all evaluation metrics, proving its effectiveness in generating accurate and relevant SQL queries from natural language inputs.

## 🔍 Inference & Example Usage

### Inference Code
Below is the recommended Python code for running inference on the fine-tuned model:

```python
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import logging

# Set up logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)

# Ensure device is set (GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the fine-tuned model and tokenizer
model_name = "aarohanverma/text2sql-flan-t5-base-qlora-finetuned"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained("aarohanverma/text2sql-flan-t5-base-qlora-finetuned")

# Ensure decoder start token is set
if model.config.decoder_start_token_id is None:
    model.config.decoder_start_token_id = tokenizer.pad_token_id

def run_inference(prompt_text: str) -> str:
    """
    Runs inference on the fine-tuned model using beam search with fixes for repetition.
    """
    inputs = tokenizer(prompt_text, return_tensors="pt", truncation=True, max_length=512).to(device)

    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        decoder_start_token_id=model.config.decoder_start_token_id, 
        max_new_tokens=100,  
        temperature=0.1,  
        num_beams=5,  
        repetition_penalty=1.2,  
        early_stopping=True, 
    )

    generated_sql = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

    generated_sql = generated_sql.split(";")[0] + ";"  # Keep only the first valid SQL query

    return generated_sql

# Example usage:
context = (
    "CREATE TABLE students (id INT PRIMARY KEY, name VARCHAR(100), age INT, grade CHAR(1)); "
    "INSERT INTO students (id, name, age, grade) VALUES "
    "(1, 'Alice', 14, 'A'), (2, 'Bob', 15, 'B'), "
    "(3, 'Charlie', 14, 'A'), (4, 'David', 16, 'C'), (5, 'Eve', 15, 'B');"
)

query = ("Retrieve the names of students who are 15 years old.")


# Construct the prompt
sample_prompt = f"""Context:
{context}

Query:
{query}

Response:
"""

logger.info("Running inference with beam search decoding.")
generated_sql = run_inference(sample_prompt)

print("Prompt:")
print("Context:")
print(context)
print("\nQuery:")
print(query)
print("\nResponse:")
print(generated_sql)

# EXPECTED RESPONSE: SELECT name FROM students WHERE age = 15; 
```

## Citation 

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

```bibtex
@misc {aarohan_verma_2025,
	author       = { {Aarohan Verma} },
	title        = { text2sql-flan-t5-base-qlora-finetuned },
	year         = 2025,
	url          = { https://huggingface.co/aarohanverma/text2sql-flan-t5-base-qlora-finetuned },
	doi          = { 10.57967/hf/4901 },
	publisher    = { Hugging Face }
}
```

## Model Card Contact

For inquiries or further information, please contact:

LinkedIn: https://www.linkedin.com/in/aarohanverma/

Email: [email protected]