aarohanverma commited on
Commit
fb8a3fb
·
verified ·
1 Parent(s): c7efa92

Added README.md

Browse files
Files changed (1) hide show
  1. README.md +300 -3
README.md CHANGED
@@ -1,3 +1,300 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Clinton/Text-to-sql-v1
5
+ - b-mc2/sql-create-context
6
+ - gretelai/synthetic_text_to_sql
7
+ - knowrohit07/know_sql
8
+ metrics:
9
+ - rouge
10
+ - bleu
11
+ - fuzzy_match
12
+ - exact_match
13
+ base_model:
14
+ - google/flan-t5-base
15
+ pipeline_tag: text2text-generation
16
+ library_name: transformers
17
+ language:
18
+ - en
19
+ tags:
20
+ - text2sql
21
+ - transformers
22
+ - flan-t5
23
+ - seq2seq
24
+ - qlora
25
+ - peft
26
+ - fine-tuning
27
+ ---
28
+ # Model Card for Model ID
29
+
30
+ <!-- Provide a quick summary of what the model is/does. -->
31
+
32
+ This model is a fine-tuned version of [Flan-T5 Base](https://huggingface.co/google/flan-t5-base) optimized to convert natural language queries into SQL statements. It leverages **QLoRA (Quantized Low-Rank Adaptation)** with PEFT for efficient adaptation and has been trained on a concatenation of several high-quality text-to-SQL datasets. A live demo is available, and users can clone and run inference directly from Hugging Face.
33
+
34
+ ## Model Details
35
+
36
+ ### Model Description
37
+
38
+ <!-- Provide a longer summary of what this model is. -->
39
+
40
+ This model is designed to generate SQL queries based on a provided natural language context and query.
41
+ It has been fine-tuned using QLoRA with 4-bit quantization and PEFT on a diverse text-to-SQL dataset.
42
+ The model demonstrates significant improvements over the original base model, making it highly suitable for practical text-to-SQL applications.
43
+
44
+ - **Developed by:** Aarohan Verma
45
+ - **Model type:** Seq2Seq / Text-to-Text Generation (SQL Generation)
46
+ - **Language(s) (NLP):** English
47
+ - **License:** Apache-2.0
48
+ - **Finetuned from model:** [google/flan-t5-base](https://huggingface.co/google/flan-t5-base)
49
+
50
+
51
+ ### Model Sources
52
+
53
+ <!-- Provide the basic links for the model. -->
54
+
55
+ - **Repository:** [https://huggingface.co/aarohanverma/text2sql-flan-t5-base-qlora-finetuned](https://huggingface.co/aarohanverma/text2sql-flan-t5-base-qlora-finetuned)
56
+ - **Demo:** [Gradio Demo](https://huggingface.co/spaces/aarohanverma/text2sql-demo)
57
+
58
+ ## Uses
59
+
60
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
61
+
62
+ ### Direct Use
63
+
64
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
65
+
66
+ This model can be used directly for generating SQL queries from natural language inputs.
67
+ It is particularly useful for applications in database querying and natural language interfaces for relational databases.
68
+
69
+ ### Downstream Use
70
+
71
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
72
+
73
+ The model can be further integrated into applications such as chatbots, data analytics platforms, and business intelligence tools to automate query generation.
74
+
75
+ ### Out-of-Scope Use
76
+
77
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
78
+
79
+ This model is not designed for tasks outside text-to-SQL generation.
80
+ It may not perform well for non-SQL language generation or queries outside the domain of structured data retrieval.
81
+
82
+ ## Bias, Risks, and Limitations
83
+
84
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
85
+
86
+ - **Bias:** The model's performance is influenced by the quality and diversity of the training data. It may underperform on SQL queries that deviate significantly from the training examples.
87
+ - **Risks:** Inaccurate SQL generation may lead to unexpected query behavior, especially in safety-critical environments.
88
+ - **Limitations:** The model may not generalize to complex SQL tasks that require deep domain knowledge beyond the training data.
89
+
90
+ ### Recommendations
91
+
92
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
93
+
94
+ Users should validate the generated SQL queries before deployment in production systems.
95
+ Consider incorporating human-in-the-loop review for critical applications.
96
+
97
+ ## How to Get Started with the Model
98
+
99
+ To get started, clone the repository or download the model from Hugging Face, then use the provided example code to run inference.
100
+ Detailed instructions and the live demo are available in this model card.
101
+
102
+ ## Training Details
103
+
104
+ ### Training Data
105
+
106
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
107
+
108
+ The model was fine-tuned on a concatenation of several publicly available text-to-SQL datasets:
109
+ 1. **[Clinton/Text-to-SQL v1](https://huggingface.co/datasets/Clinton/Text-to-sql-v1)**
110
+ 2. **[b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)**
111
+ 3. **[gretelai/synthetic_text_to_sql](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql)**
112
+ 4. **[knowrohit07/know_sql](https://huggingface.co/datasets/knowrohit07/know_sql)**
113
+
114
+ **Data Split:**
115
+ - **Training:** 85%
116
+ - **Validation:** 5%
117
+ - **Testing:** 10%
118
+
119
+ ### Training Procedure
120
+
121
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
122
+
123
+ #### Preprocessing
124
+
125
+ The raw data was preprocessed as follows:
126
+ - **Cleaning:** Removal of extra whitespaces/newlines and standardization of columns (renaming to `query`, `context`, and `response`).
127
+ - **Filtering:** Dropping examples with missing values and duplicates; retaining only rows where the prompt is ≤ 500 tokens and the response is ≤ 250 tokens.
128
+ - **Tokenization:**
129
+
130
+ Prompts are constructed in the format:
131
+ ```
132
+ Context:
133
+ {context}
134
+
135
+ Query:
136
+ {query}
137
+
138
+ Response:
139
+ ```
140
+ and tokenized with a maximum length of 512 for inputs and 256 for responses using [google/flan-t5-base](https://huggingface.co/google/flan-t5-base)'s tokenizer.
141
+
142
+ #### Training Hyperparameters
143
+
144
+ - **Epochs:** 6
145
+ - **Batch Sizes:**
146
+ Training: 64 per device
147
+ Evaluation: 64 per device
148
+ - **Gradient Accumulation:** 2 steps
149
+ - **Learning Rate:** 2e-4
150
+ - **Optimizer:** `adamw_bnb_8bit` (memory-efficient variant of AdamW)
151
+ - **LR Scheduler:** Cosine scheduler with a warmup ratio of 10%
152
+ - **Quantization:** 4-bit NF4 (with double quantization) using `torch.bfloat16`
153
+ - **LoRA Parameters:**
154
+ - **Rank (r):** 32
155
+ - **Alpha:** 64
156
+ - **Dropout:** 0.1
157
+ - **Target Modules:** `["q", "v"]`
158
+ - **Checkpointing:**
159
+ Model saved at the end of every epoch
160
+ Early stopping with a patience of 2 epochs based on evaluation loss
161
+ - **Reproducibility:** Random seeds are set across Python, NumPy, and PyTorch (seed = 42)
162
+
163
+ ## Evaluation
164
+
165
+ <!-- This section describes the evaluation protocols and provides the results. -->
166
+
167
+ #### Metrics
168
+
169
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
170
+
171
+ Evaluation metrics used:
172
+ - **ROUGE:** Measures n-gram overlap between generated and reference SQL.
173
+ - **BLEU:** Assesses the quality of translation from natural language to SQL.
174
+ - **Fuzzy Match Score:** Uses token-set similarity to provide a soft match percentage.
175
+ - **Exact Match Accuracy:** Percentage of queries that exactly match the reference SQL.
176
+
177
+ ### Results
178
+
179
+ The table below summarizes the evaluation metrics comparing the original base model with the fine-tuned model:
180
+
181
+ | **Metric** | **Original Model** | **Fine-Tuned Model** | **Improvement Commentary** |
182
+ |---------------------------|-------------------------------|-------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------|
183
+ | **ROUGE-1** | 0.03369 | **0.69143** | Over 20× increase; indicates much better content capture. |
184
+ | **ROUGE-2** | 0.00817 | **0.54533** | Nearly 67× improvement; higher n-gram quality. |
185
+ | **ROUGE-L** | 0.03056 | **0.66429** | More than 21× increase; improved sequence similarity. |
186
+ | **BLEU Score** | 0.00367 | **0.31698** | Approximately 86× increase; demonstrates significant fluency gains. |
187
+ | **Fuzzy Match Score** | 11.31% | **81.98%** | Substantial improvement; generated SQL aligns much closer with human responses. |
188
+ | **Exact Match Accuracy** | 0.00% | **16.39%** | Non-zero accuracy achieved; critical for production-readiness. |
189
+
190
+
191
+ #### Summary
192
+
193
+ The fine-tuned model shows dramatic improvements across all evaluation metrics, proving its effectiveness in generating accurate and relevant SQL queries from natural language inputs.
194
+
195
+ ## 🔍 Inference & Example Usage
196
+
197
+ ### Inference Code
198
+ Below is the recommended Python code for running inference on the fine-tuned model:
199
+
200
+ ```python
201
+ import torch
202
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
203
+ import logging
204
+
205
+ # Set up logging
206
+ logging.basicConfig(
207
+ level=logging.INFO,
208
+ format="%(asctime)s - %(levelname)s - %(message)s",
209
+ )
210
+ logger = logging.getLogger(__name__)
211
+
212
+ # Set device (GPU if available)
213
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
214
+
215
+ # Load the fine-tuned model and tokenizer
216
+ model_name = "aarohanverma/text2sql-flan-t5-base-qlora-finetuned"
217
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to(device)
218
+ tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
219
+
220
+ def run_inference(prompt_text: str) -> str:
221
+ """
222
+ Runs inference using deterministic decoding with beam search.
223
+ """
224
+ inputs = tokenizer(prompt_text, return_tensors="pt").to(device)
225
+ generated_ids = model.generate(
226
+ input_ids=inputs["input_ids"],
227
+ max_new_tokens=250,
228
+ temperature=0.0,
229
+ num_beams=3,
230
+ early_stopping=True,
231
+ )
232
+ return tokenizer.decode(generated_ids[0], skip_special_tokens=True)
233
+
234
+ # Example usage:
235
+ context = (
236
+ "CREATE TABLE customers (id INT PRIMARY KEY, name VARCHAR(100), country VARCHAR(50)); "
237
+ "CREATE TABLE orders (order_id INT PRIMARY KEY, customer_id INT, total_amount DECIMAL(10,2), "
238
+ "order_date DATE, FOREIGN KEY (customer_id) REFERENCES customers(id)); "
239
+ "INSERT INTO customers (id, name, country) VALUES (1, 'Alice', 'USA'), (2, 'Bob', 'UK'), "
240
+ "(3, 'Charlie', 'Canada'), (4, 'David', 'USA'); "
241
+ "INSERT INTO orders (order_id, customer_id, total_amount, order_date) VALUES "
242
+ "(101, 1, 500, '2024-01-15'), (102, 2, 300, '2024-01-20'), "
243
+ "(103, 1, 700, '2024-02-10'), (104, 3, 450, '2024-02-15'), "
244
+ "(105, 4, 900, '2024-03-05');"
245
+ )
246
+ query = (
247
+ "Retrieve the total order amount for each customer, showing only customers from the USA, "
248
+ "and sort the result by total order amount in descending order."
249
+ )
250
+
251
+ # Construct the prompt
252
+ sample_prompt = f"""Context:
253
+ {context}
254
+
255
+ Query:
256
+ {query}
257
+
258
+ Response:
259
+ """
260
+
261
+ logger.info("Running inference with beam search decoding.")
262
+ generated_sql = run_inference(sample_prompt)
263
+
264
+ print("Prompt:")
265
+ print("Context:")
266
+ print(context)
267
+ print("\nQuery:")
268
+ print(query)
269
+ print("\nResponse:")
270
+ print(generated_sql)
271
+
272
+ # Expected Output:
273
+ # SELECT customers.name, SUM(orders.total_amount) as total_amount FROM customers
274
+ # INNER JOIN orders ON customers.id = orders.customer_id
275
+ # WHERE customers.country = 'USA'
276
+ # GROUP BY customers.name
277
+ # ORDER BY total_amount DESC;
278
+ ```
279
+
280
+ ## Citation
281
+
282
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
283
+
284
+ **BibTeX:**
285
+
286
+ ```bibtex
287
+ @misc{aarohanverma_text2sql_2025,
288
+ title={Text-to-SQL Fine-Tuned Model (Flan-T5 Base)},
289
+ author={Aarohan Verma},
290
+ year={2025},
291
+ url={https://huggingface.co/aarohanverma/text2sql-flan-t5-base-qlora-finetuned}
292
+ }
293
+ ```
294
+
295
+ ## Model Card Contact
296
+
297
+ For inquiries or further information, please contact:
298
+
299
+ LinkedIn: https://www.linkedin.com/in/aarohanverma/
300