File size: 6,142 Bytes
5ffa88d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
frameworks:
- Pytorch
license: Apache License 2.0
tasks:
- text-generation

#model-type:
##如 gpt、phi、llama、chatglm、baichuan 等
#- gpt

#domain:
##如 nlp、cv、audio、multi-modal
#- nlp

#language:
##语言代码列表 https://help.aliyun.com/document_detail/215387.html?spm=a2c4g.11186623.0.0.9f8d7467kni6Aa
#- cn 

#metrics:
##如 CIDEr、Blue、ROUGE 等
#- CIDEr

#tags:
##各种自定义,包括 pretrained、fine-tuned、instruction-tuned、RL-tuned 等训练方法和其他
#- pretrained

#tools:
##如 vllm、fastchat、llamacpp、AdaSeq 等
#- vllm
---
### Important Links

🤖[Github](https://github.com/XGenerationLab/XiYanSQL-QwenCoder) |
📖[XiYan-SQL](https://github.com/XGenerationLab/XiYan-SQL) |
🌕[析言GBI](https://bailian.console.aliyun.com/xiyan) |
🤗[Modelscope Space](https://www.modelscope.cn/studios/XGenerationLab/XiYanSQL-QwenCoder-32B)


## Introduction
We are excited to open source the XiYanSQL-QwenCoder series model, dedicated to advancing the development of LLMs in the text-to-SQL domain. As of now, XiYanSQL-QwenCoder covers four mainstream model sizes: 3B, 7B, 14B, and 32B parameters, to meet the needs of different developers.
- The XiYanSQL-QwenCoder model demonstrates strong performance in SQL generation, with the XiYanSQL-QwenCoder-32B achieving a 69.03% EX score on the BIRD TEST set, setting a new SOTA with a single fine-tuned model. Other models in the series also maintain a leading position at their respective sizes.
- The XiYanSQL-QwenCoder model supports multiple SQL dialects, such as SQLite, PostgreSQL, and MySQL.
- The XiYanSQL-QwenCoder model can be used directly for text-to-SQL tasks or serve as a better starting point for fine-tuning SQL models.


## Model Downloads


| **Model** | **Download Latest** |
|-----------|------------------|
|XiYanSQL-QwenCoder-3B  |[🤗 Modelscope](https://www.modelscope.cn/models/XGenerationLab/XiYanSQL-QwenCoder-3B-2502)|
|XiYanSQL-QwenCoder-7B  |[🤗 Modelscope](https://www.modelscope.cn/models/XGenerationLab/XiYanSQL-QwenCoder-7B-2502)|
|XiYanSQL-QwenCoder-14B |[🤗 Modelscope](https://www.modelscope.cn/models/XGenerationLab/XiYanSQL-QwenCoder-14B-2502)|
|XiYanSQL-QwenCoder-32B |[🤗 Modelscope](https://www.modelscope.cn/models/XGenerationLab/XiYanSQL-QwenCoder-32B-2412)|



## Performance
The XiYanSQL-QwenCoder models, as multi-dialect SQL base models, demonstrating robust SQL generation capabilities. The following presents the evaluation results at the time of release. We conducted a comprehensive evaluation of the model's performance under two schema formats, M-Schema, and original DDL, using the BIRD and Spider benchmarks in the Text-to-SQL domain.

| Model name|BIRD Dev@M-Schema |BIRD Dev@DDL|Spider Test@M-Schema|Spider Test@DDL|
|-----------|:------------------:|:---------------:|:-------------------:|:---------------:|
|Codellama-34b              | 33.05%     | -          | 67.74%      | -           |
|Deepseek-coder-33b         | 47.52%     | 44.72%     | 72.39%      | -           |
|TableGPT2                  | 46.35%     | 47.07%     | 74.76%      | 77.28%      |
|Codestral 22b              | 50.52%     | 47.00%     | 78.45%      | 75.47%      |
|GLM-4-plus                 | 54.37%     | -          | 79.40%      | -           |
|Claude35_sonnet-1022       | 53.32%     | 50.46%     | 76.27%      | 73.04%      |
|Deepseek(v2.5-1210)        | 55.74%     | 55.61%     | 82.08%      | 80.57%      |
|Gemini-1.5-pro             | 61.34%     | 57.89%     | 85.11%      | 84.00%      |
|GPT-4o-0806                | 58.47%     | 54.82%     | 82.89%      | 78.45%      |
|XiYanSQL-QwenCoder-3B      | 54.11%     | 53.19%     | 82.69%      | 78.85%      |
|XiYanSQL-QwenCoder-7B      | 59.78%     | 56.58%     | 84.86%      | 80.31%      |
|XiYanSQL-QwenCoder-14B     | 63.10%     | 60.37%     | 85.76%      | 82.79%      |
|XiYanSQL-QwenCoder-32B     | 67.01%     | 63.04%     | 88.39%      | 85.46%      |


## Requirements

transformers >= 4.37.0

## Quickstart

Here is a simple code snippet for quickly using **XiYanSQL-QwenCoder** model. We provide a Chinese version of the prompt, and you just need to replace the placeholders for "question," "db_schema," and "evidence" to get started. We recommend using our [M-Schema](https://github.com/XGenerationLab/M-Schema) format for the schema; other formats such as DDL are also acceptable, but they may affect performance.
Currently, we mainly support mainstream dialects like SQLite, PostgreSQL, and MySQL.

```

nl2sqlite_template_cn = """你是一名{dialect}专家,现在需要阅读并理解下面的【数据库schema】描述,以及可能用到的【参考信息】,并运用{dialect}知识生成sql语句回答【用户问题】。
【用户问题】
{question}

【数据库schema】
{db_schema}

【参考信息】
{evidence}

【用户问题】
{question}

```sql"""

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "XGenerationLab/XiYanSQL-QwenCoder-32B-2412"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

## dialects -> ['SQLite', 'PostgreSQL', 'MySQL']
prompt = nl2sqlite_template_cn.format(dialect="", db_schema="", question="", evidence="")
message = [{'role': 'user', 'content': prompt}]

text = tokenizer.apply_chat_template(
    message,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=1024,
    temperature=0.1,
    top_p=0.8,
    do_sample=True,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

```


## Acknowledgments
If you find our work useful, please give us a citation or a like, so we can make a greater contribution to the open-source community!