File size: 2,997 Bytes
6eaaf53 c09153a 6eaaf53 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
---
tags:
- text2sql
- transformers
- natural-language-to-sql
- transformers
- t5
- spider-dataset
license: apache-2.0
---
# Model Card for Fine-Tuned T5 for Text-to-SQL
## Model Details
### Model Description
This is a fine-tuned T5-small model for generating SQL queries from natural language. It was trained on the [Spider dataset](https://huggingface.co/datasets/spider), a benchmark dataset for text-to-SQL tasks.
- **Developed by:** OSLLM.ai
- **Shared by:** OSLLM.ai
- **Model type:** Text-to-SQL (Sequence-to-Sequence)
- **Language(s):** English
- **License:** Apache 2.0
- **Finetuned from:** [t5-small](https://huggingface.co/t5-small)
## Uses
### Direct Use
This model can be used to generate SQL queries from natural language questions. It is particularly useful for developers building natural language interfaces to databases.
### Downstream Use
The model can be fine-tuned further on domain-specific datasets for improved performance.
### Out-of-Scope Use
This model is not suitable for generating SQL queries for databases with highly specialized schemas or non-standard SQL dialects.
## Bias, Risks, and Limitations
The model may generate incorrect or unsafe SQL queries if the input question is ambiguous or outside the scope of the training data. Always validate the generated SQL before executing it on a production database.
## How to Get Started with the Model
```python
from transformers import T5Tokenizer, T5ForConditionalGeneration
# Load the fine-tuned model
model = T5ForConditionalGeneration.from_pretrained("osllmai/text-to-sql")
tokenizer = T5Tokenizer.from_pretrained("osllmai/text-to-sql")
# Generate SQL query
def generate_sql_query(question):
input_text = f"translate English to SQL: {question}"
input_ids = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True, padding="max_length").input_ids
outputs = model.generate(input_ids)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Example usage
question = "Find all the customers who live in New York."
sql_query = generate_sql_query(question)
print(sql_query)
```
## Training Details
### Training Data
The model was trained on the [Spider dataset](https://huggingface.co/datasets/spider), which contains 10,181 questions and 5,693 unique complex SQL queries across 200 databases.
### Training Procedure
- **Preprocessing:** Questions were prefixed with "translate English to SQL:" and tokenized using the T5 tokenizer.
- **Training Hyperparameters:**
- Learning Rate: 2e-5
- Batch Size: 8
- Epochs: 3
- Mixed Precision: FP16
## Evaluation
The model was evaluated on the Spider validation set. Metrics such as exact match accuracy and execution accuracy can be used to assess performance.
## Environmental Impact
- **Hardware:** 1x NVIDIA T4 GPU (Google Colab)
- **Hours Used:** ~3 hours
- **Carbon Emitted:** [Estimate using the [ML CO2 Impact Calculator](https://mlco2.github.io/impact)]
## Model Card Authors
Fateme Ghasemi
```
|