princepride commited on
Commit
b5f4828
·
verified ·
1 Parent(s): 6303ac3

Upload 4 files

Browse files
Files changed (4) hide show
  1. README.md +12 -199
  2. model.py +416 -0
  3. pinyin.txt +408 -0
  4. support_language.json +210 -0
README.md CHANGED
@@ -1,199 +1,12 @@
1
- ---
2
- library_name: transformers
3
- tags: []
4
- ---
5
-
6
- # Model Card for Model ID
7
-
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
-
11
-
12
- ## Model Details
13
-
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
1
+ # Qwen2-7B-Instruct-Full-Finetune
2
+ Qwen2-7B-Instruct-Full-Finetune is a cutting-edge model designed for multilingual instruction-following tasks. Developed to bridge gaps in multilingual conversational AI, Qwen2-7B focuses on understanding and generating contextually accurate responses across diverse languages and dialects. This model empowers developers, researchers, and businesses to integrate advanced language processing capabilities into their applications, offering natural and fluent interactions across a global spectrum of languages.
3
+
4
+ ## Introduction
5
+ Qwen2-7B-Instruct-Full-Finetune is an initiative to expand conversational AI's capabilities in a way that ensures inclusivity, flexibility, and ease of integration. Built with a deep focus on multilingual understanding, it leverages large-scale fine-tuning and instruction-based training techniques to generate nuanced and context-aware responses. This project aims to support language diversity and inclusivity, making it accessible for use cases ranging from customer support and education to content creation and translation.
6
+
7
+ ## Features
8
+ Multilingual Understanding: Qwen2-7B-Instruct-Full-Finetune supports a wide range of languages, including those often underrepresented in standard AI models, providing accurate and culturally aware responses across diverse linguistic contexts.
9
+ Contextual and Instruction-Based Responses: Trained to follow complex instructions and maintain conversational flow, it delivers responses that are contextually relevant and engaging, making it ideal for a variety of instructional tasks.
10
+ Scalability and Integration: Designed with integration in mind, Qwen2-7B can be easily deployed in applications like chatbots, virtual assistants, and customer support systems, extending the potential for multilingual interaction in various digital products.
11
+ Open Source and Community Driven: Aligned with the principles of open science, Qwen2-7B-Instruct-Full-Finetune is available for the developer community to use, adapt, and enhance, fostering collaboration and furthering innovation in conversational AI.
12
+ This README is a guide for those looking to leverage Qwen2-7B-Instruct-Full-Finetune to build inclusive, multilingual, and instruction-following AI applications.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
model.py ADDED
@@ -0,0 +1,416 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import AutoModelForCausalLM, AutoTokenizer
2
+ import torch
3
+ from modules.file import ExcelFileWriter
4
+ import os
5
+
6
+ from abc import ABC, abstractmethod
7
+ from typing import List
8
+ import re
9
+
10
+ class FilterPipeline():
11
+ def __init__(self, filter_list):
12
+ self._filter_list:List[Filter] = filter_list
13
+
14
+ def append(self, filter):
15
+ self._filter_list.append(filter)
16
+
17
+ def batch_encoder(self, inputs):
18
+ for filter in self._filter_list:
19
+ inputs = filter.encoder(inputs)
20
+ return inputs
21
+
22
+ def batch_decoder(self, inputs):
23
+ for filter in reversed(self._filter_list):
24
+ inputs = filter.decoder(inputs)
25
+ return inputs
26
+
27
+ class Filter(ABC):
28
+ # 抽象基类,用于定义过滤器的基本接口
29
+ def __init__(self):
30
+ self.name = 'filter' # 过滤器的名称
31
+ self.code = [] # 存储过滤或编码信息
32
+ @abstractmethod
33
+ def encoder(self, inputs):
34
+ # 抽象方法,编码或过滤输入的接口
35
+ pass
36
+
37
+ @abstractmethod
38
+ def decoder(self, inputs):
39
+ # 抽象方法,解码或还原输入的接口
40
+ pass
41
+
42
+ class SpecialTokenFilter(Filter):
43
+ # 特殊字符过滤器,用于过滤特定的特殊字符字符串
44
+ def __init__(self):
45
+ self.name = 'special token filter'
46
+ self.code = []
47
+ self.special_tokens = ['!', '!', '-'] # 定义特殊字符集
48
+
49
+ def encoder(self, inputs):
50
+ # 编码函数,过滤掉仅包含特殊字符的字符串
51
+ filtered_inputs = []
52
+ self.code = []
53
+ for i, input_str in enumerate(inputs):
54
+ if not all(char in self.special_tokens for char in input_str):
55
+ filtered_inputs.append(input_str)
56
+ else:
57
+ self.code.append([i, input_str]) # 将特殊字符字符串的位置和内容保存
58
+ return filtered_inputs
59
+
60
+ def decoder(self, inputs):
61
+ # 解码函数,将被过滤的特殊字符字符串还原
62
+ original_inputs = inputs.copy()
63
+ for removed_indice in self.code:
64
+ original_inputs.insert(removed_indice[0], removed_indice[1]) # 恢复原始位置的字符串
65
+ return original_inputs
66
+
67
+ class SperSignFilter(Filter):
68
+ # 特殊标记过滤器,用于处理包含 '%s' 的字符串
69
+ def __init__(self):
70
+ self.name = 's percentage sign filter'
71
+ self.code = []
72
+
73
+ def encoder(self, inputs):
74
+ # 编码函数,将 '%s' 替换为 '*'
75
+ encoded_inputs = []
76
+ self.code = []
77
+ for i, input_str in enumerate(inputs):
78
+ if '%s' in input_str:
79
+ encoded_str = input_str.replace('%s', '*')
80
+ self.code.append(i) # 保存包含 '%s' 的字符串位置
81
+ else:
82
+ encoded_str = input_str
83
+ encoded_inputs.append(encoded_str)
84
+ return encoded_inputs
85
+
86
+ def decoder(self, inputs):
87
+ # 解码函数,将 '*' 还原为 '%s'
88
+ decoded_inputs = inputs.copy()
89
+ for i in self.code:
90
+ decoded_inputs[i] = decoded_inputs[i].replace('*', '%s')
91
+ return decoded_inputs
92
+
93
+ class ParenSParenFilter(Filter):
94
+ # 特殊字符串过滤器,用于处理 '(s)' 的字符串
95
+ def __init__(self):
96
+ self.name = 'Paren s paren filter'
97
+ self.code = []
98
+
99
+ def encoder(self, inputs):
100
+ # 编码函数,将 '(s)' 替换为 '$'
101
+ encoded_inputs = []
102
+ self.code = []
103
+ for i, input_str in enumerate(inputs):
104
+ if '(s)' in input_str:
105
+ encoded_str = input_str.replace('(s)', '$')
106
+ self.code.append(i) # 保存包含 '(s)' 的字符串位置
107
+ else:
108
+ encoded_str = input_str
109
+ encoded_inputs.append(encoded_str)
110
+ return encoded_inputs
111
+
112
+ def decoder(self, inputs):
113
+ # 解码函数,将 '$' 还原为 '(s)'
114
+ decoded_inputs = inputs.copy()
115
+ for i in self.code:
116
+ decoded_inputs[i] = decoded_inputs[i].replace('$', '(s)')
117
+ return decoded_inputs
118
+
119
+ class ChevronsFilter(Filter):
120
+ # 尖括号过滤器,用于处理包含 '<>' 内容的字符串
121
+ def __init__(self):
122
+ self.name = 'chevrons filter'
123
+ self.code = []
124
+
125
+ def encoder(self, inputs):
126
+ # 编码函数,将尖括号内的内容替换为 '#'
127
+ encoded_inputs = []
128
+ self.code = []
129
+ pattern = re.compile(r'<.*?>')
130
+ for i, input_str in enumerate(inputs):
131
+ if pattern.search(input_str):
132
+ matches = pattern.findall(input_str)
133
+ encoded_str = pattern.sub('#', input_str)
134
+ self.code.append((i, matches)) # 保存匹配内容的位置和内容
135
+ else:
136
+ encoded_str = input_str
137
+ encoded_inputs.append(encoded_str)
138
+ return encoded_inputs
139
+
140
+ def decoder(self, inputs):
141
+ # 解码函数,将 '#' 还原为尖括号内的原内容
142
+ decoded_inputs = inputs.copy()
143
+ for i, matches in self.code:
144
+ for match in matches:
145
+ decoded_inputs[i] = decoded_inputs[i].replace('#', match, 1)
146
+ return decoded_inputs
147
+
148
+ class SimilarFilter(Filter):
149
+ # 相似字符串过滤器,用于处理只在数字上有区别的字符串
150
+ def __init__(self):
151
+ self.name = 'similar filter'
152
+ self.code = []
153
+
154
+ def is_similar(self, str1, str2):
155
+ # 判断两个字符串是否相似(忽略数字)
156
+ pattern = re.compile(r'\d+')
157
+ return pattern.sub('', str1) == pattern.sub('', str2)
158
+
159
+ def encoder(self, inputs):
160
+ # 编码函数,检测连续的相似字符串,记录索引和内容
161
+ encoded_inputs = []
162
+ self.code = []
163
+ i = 0
164
+ while i < len(inputs):
165
+ encoded_inputs.append(inputs[i])
166
+ similar_strs = [inputs[i]]
167
+ j = i + 1
168
+ while j < len(inputs) and self.is_similar(inputs[i], inputs[j]):
169
+ similar_strs.append(inputs[j])
170
+ j += 1
171
+ if len(similar_strs) > 1:
172
+ self.code.append((i, similar_strs))
173
+ i = j
174
+ return encoded_inputs
175
+
176
+ def decoder(self, inputs):
177
+ # 解码函数,将被检测的相似字符串插回原位置
178
+ decoded_inputs = inputs
179
+ for i, similar_strs in self.code:
180
+ pattern = re.compile(r'\d+')
181
+ for j in range(len(similar_strs)):
182
+ if pattern.search(similar_strs[j]):
183
+ number = re.findall(r'\d+', similar_strs[j])[0]
184
+ new_str = pattern.sub(number, inputs[i])
185
+ else:
186
+ new_str = inputs[i]
187
+ if j > 0:
188
+ decoded_inputs.insert(i + j, new_str)
189
+ return decoded_inputs
190
+
191
+ class ChineseFilter:
192
+ # 中文拼音过滤器,用于检测并过滤中文拼音单词
193
+ def __init__(self, pinyin_lib_file='pinyin.txt'):
194
+ self.name = 'chinese filter'
195
+ self.code = []
196
+ self.pinyin_lib = self.load_pinyin_lib(pinyin_lib_file) # 加载拼音库
197
+
198
+ def load_pinyin_lib(self, file_path):
199
+ # 加载拼音库文件到内存中
200
+ with open(os.path.join(script_dir, file_path), 'r', encoding='utf-8') as f:
201
+ return set(line.strip().lower() for line in f)
202
+
203
+ def is_valid_chinese(self, word):
204
+ # 判断一个单词是否符合要求: 单词仅由一个单词构成且首字母大写
205
+ if len(word.split()) == 1 and word[0].isupper():
206
+ return self.is_pinyin(word.lower())
207
+ return False
208
+
209
+ def encoder(self, inputs):
210
+ # 编码函数,检测并过滤符合拼音规则的中文单词
211
+ encoded_inputs = []
212
+ self.code = []
213
+ for i, word in enumerate(inputs):
214
+ if self.is_valid_chinese(word):
215
+ self.code.append((i, word)) # 保存符合要求的中文单词及其索引
216
+ else:
217
+ encoded_inputs.append(word)
218
+ return encoded_inputs
219
+
220
+ def decoder(self, inputs):
221
+ # 解码函数,将符合拼音规则的中文单词还原到原位置
222
+ decoded_inputs = inputs.copy()
223
+ for i, word in self.code:
224
+ decoded_inputs.insert(i, word)
225
+ return decoded_inputs
226
+
227
+ def is_pinyin(self, string):
228
+ # 判断字符串是否是拼音或英文单词
229
+ string = string.lower()
230
+ stringlen = len(string)
231
+ max_len = 6
232
+ result = []
233
+ n = 0
234
+ while n < stringlen:
235
+ matched = 0
236
+ temp_result = []
237
+ for i in range(max_len, 0, -1):
238
+ s = string[0:i]
239
+ if s in self.pinyin_lib:
240
+ temp_result.append(string[:i])
241
+ matched = i
242
+ break
243
+ if i == 1 and len(temp_result) == 0:
244
+ return False
245
+ result.extend(temp_result)
246
+ string = string[matched:]
247
+ n += matched
248
+ return True
249
+
250
+ # 定义脚本目录的路径,供拼音文件加载使用
251
+ script_dir = os.path.dirname(os.path.abspath(__file__))
252
+ parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(script_dir)))
253
+
254
+
255
+ class Model():
256
+ def __init__(self, modelname, selected_lora_model, selected_gpu):
257
+ def get_gpu_index(gpu_info, target_gpu_name):
258
+ """
259
+ 从 GPU 信息中获取目标 GPU 的索引
260
+ Args:
261
+ gpu_info (list): 包含 GPU 名称的列表
262
+ target_gpu_name (str): 目标 GPU 的名称
263
+
264
+ Returns:
265
+ int: 目标 GPU 的索引,如果未找到则返回 -1
266
+ """
267
+ for i, name in enumerate(gpu_info):
268
+ if target_gpu_name.lower() in name.lower():
269
+ return i
270
+ return -1
271
+ if selected_gpu != "cpu":
272
+ gpu_count = torch.cuda.device_count()
273
+ gpu_info = [torch.cuda.get_device_name(i) for i in range(gpu_count)]
274
+ selected_gpu_index = get_gpu_index(gpu_info, selected_gpu)
275
+ self.device_name = f"cuda:{selected_gpu_index}"
276
+ else:
277
+ self.device_name = "cpu"
278
+ print("device_name", self.device_name)
279
+ self.model = AutoModelForCausalLM.from_pretrained(modelname, torch_dtype="auto").to(self.device_name)
280
+ self.tokenizer = AutoTokenizer.from_pretrained(modelname)
281
+ # self.translator = pipeline('translation', model=self.original_model, tokenizer=self.tokenizer, src_lang=original_language, tgt_lang=target_language, device=device)
282
+
283
+ def generate(self, inputs, original_language, target_languages, max_batch_size):
284
+ filter_list = [SpecialTokenFilter(), ChevronsFilter(), SimilarFilter(), ChineseFilter()]
285
+ filter_pipeline = FilterPipeline(filter_list)
286
+ def process_gpu_translate_result(temp_outputs):
287
+ outputs = []
288
+ for temp_output in temp_outputs:
289
+ length = len(temp_output[0]["generated_translation"])
290
+ for i in range(length):
291
+ temp = []
292
+ for trans in temp_output:
293
+ temp.append({
294
+ "target_language": trans["target_language"],
295
+ "generated_translation": trans['generated_translation'][i],
296
+ })
297
+ outputs.append(temp)
298
+ excel_writer = ExcelFileWriter()
299
+ excel_writer.write_text(os.path.join(parent_dir,r"temp/empty.xlsx"), outputs, 'A', 1, len(outputs))
300
+ if self.device_name == "cpu":
301
+ # Tokenize input
302
+ input_ids = self.tokenizer(inputs, return_tensors="pt", padding=True, max_length=128).to(self.device_name)
303
+ output = []
304
+ for target_language in target_languages:
305
+ # Get language code for the target language
306
+ target_lang_code = self.tokenizer.lang_code_to_id[language_mapping(target_language)]
307
+ # Generate translation
308
+ generated_tokens = self.model.generate(
309
+ **input_ids,
310
+ forced_bos_token_id=target_lang_code,
311
+ max_length=128
312
+ )
313
+ generated_translation = self.tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
314
+ # Append result to output
315
+ output.append({
316
+ "target_language": target_language,
317
+ "generated_translation": generated_translation,
318
+ })
319
+ outputs = []
320
+ length = len(output[0]["generated_translation"])
321
+ for i in range(length):
322
+ temp = []
323
+ for trans in output:
324
+ temp.append({
325
+ "target_language": trans["target_language"],
326
+ "generated_translation": trans['generated_translation'][i],
327
+ })
328
+ outputs.append(temp)
329
+ return outputs
330
+ else:
331
+ # 最大批量大小 = 可用 GPU 内存字节数 / 4 / (张量大小 + 可训练参数)
332
+ # max_batch_size = 10
333
+ # Ensure batch size is within model limits:
334
+ print("length of inputs: ",len(inputs))
335
+ batch_size = min(len(inputs), int(max_batch_size))
336
+ batches = [inputs[i:i + batch_size] for i in range(0, len(inputs), batch_size)]
337
+ print("length of batches size: ", len(batches))
338
+ temp_outputs = []
339
+ processed_num = 0
340
+ for index, batch in enumerate(batches):
341
+ # Tokenize input
342
+ print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
343
+ print(len(batch))
344
+ print(batch)
345
+ batch = filter_pipeline.batch_encoder(batch)
346
+ print(batch)
347
+ temp = []
348
+ if len(batch) > 0:
349
+ for target_language in target_languages:
350
+ batch_messages = [[
351
+ {"role": "system", "content": f"你是一个ERP系统中译英专家,你任务是把markdown格式的文本,保留其格式并从{original_language}翻译成{target_language},不要添加多余的内容。"},
352
+ {"role": "user", "content": input},
353
+ ] for input in batch]
354
+ batch_texts = [self.tokenizer.apply_chat_template(
355
+ messages,
356
+ tokenize=False,
357
+ add_generation_prompt=True
358
+ ) for messages in batch_messages]
359
+ self.tokenizer.padding_side = "left"
360
+ model_inputs = self.tokenizer(
361
+ batch_texts,
362
+ return_tensors="pt",
363
+ padding="longest",
364
+ truncation=True,
365
+ ).to(self.device_name)
366
+ generated_ids = self.model.generate(
367
+ max_new_tokens=512,
368
+ **model_inputs
369
+ )
370
+ # Calculate the length of new tokens generated for each sequence
371
+ new_tokens = [
372
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
373
+ ]
374
+ generated_translation = self.tokenizer.batch_decode(new_tokens, skip_special_tokens=True)
375
+ # Append result to output
376
+ temp.append({
377
+ "target_language": target_language,
378
+ "generated_translation": generated_translation,
379
+ })
380
+ model_inputs.to('cpu')
381
+ del model_inputs
382
+ else:
383
+ for target_language in target_languages:
384
+ generated_translation = filter_pipeline.batch_decoder(batch)
385
+ print(generated_translation)
386
+ print(len(generated_translation))
387
+ # Append result to output
388
+ temp.append({
389
+ "target_language": target_language,
390
+ "generated_translation": generated_translation,
391
+ })
392
+ temp_outputs.append(temp)
393
+ processed_num += len(batch)
394
+ if (index + 1) * max_batch_size // 1000 - index * max_batch_size // 1000 == 1:
395
+ print("Already processed number: ", len(temp_outputs))
396
+ process_gpu_translate_result(temp_outputs)
397
+ outputs = []
398
+ for temp_output in temp_outputs:
399
+ length = len(temp_output[0]["generated_translation"])
400
+ for i in range(length):
401
+ temp = []
402
+ for trans in temp_output:
403
+ temp.append({
404
+ "target_language": trans["target_language"],
405
+ "generated_translation": trans['generated_translation'][i],
406
+ })
407
+ outputs.append(temp)
408
+ return outputs
409
+ for filter in self._filter_list:
410
+ inputs = filter.encoder(inputs)
411
+ return inputs
412
+
413
+ def batch_decoder(self, inputs):
414
+ for filter in reversed(self._filter_list):
415
+ inputs = filter.decoder(inputs)
416
+ return inputs
pinyin.txt ADDED
@@ -0,0 +1,408 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ a
2
+ ai
3
+ an
4
+ ang
5
+ ao
6
+ ba
7
+ bai
8
+ ban
9
+ bang
10
+ bao
11
+ bei
12
+ ben
13
+ beng
14
+ bi
15
+ bian
16
+ biao
17
+ bie
18
+ bin
19
+ bing
20
+ bo
21
+ bu
22
+ ca
23
+ cai
24
+ can
25
+ cang
26
+ cao
27
+ ce
28
+ cen
29
+ ceng
30
+ cha
31
+ chai
32
+ chan
33
+ chang
34
+ chao
35
+ che
36
+ chen
37
+ cheng
38
+ chi
39
+ chong
40
+ chou
41
+ chu
42
+ chua
43
+ chuai
44
+ chuan
45
+ chuang
46
+ chui
47
+ chun
48
+ chuo
49
+ ci
50
+ cong
51
+ cou
52
+ cu
53
+ cuan
54
+ cui
55
+ cun
56
+ cuo
57
+ da
58
+ dai
59
+ dan
60
+ dang
61
+ dao
62
+ de
63
+ dei
64
+ den
65
+ deng
66
+ di
67
+ dia
68
+ dian
69
+ diao
70
+ die
71
+ ding
72
+ diu
73
+ dong
74
+ dou
75
+ du
76
+ duan
77
+ dui
78
+ dun
79
+ duo
80
+ e
81
+ ei
82
+ en
83
+ eng
84
+ er
85
+ fa
86
+ fan
87
+ fang
88
+ fei
89
+ fen
90
+ feng
91
+ fo
92
+ fou
93
+ fu
94
+ ga
95
+ gai
96
+ gan
97
+ gang
98
+ gao
99
+ ge
100
+ gei
101
+ gen
102
+ geng
103
+ gong
104
+ gou
105
+ gu
106
+ gua
107
+ guai
108
+ guan
109
+ guang
110
+ gui
111
+ gun
112
+ guo
113
+ ha
114
+ hai
115
+ han
116
+ hang
117
+ hao
118
+ he
119
+ hei
120
+ hen
121
+ heng
122
+ hong
123
+ hou
124
+ hu
125
+ hua
126
+ huai
127
+ huan
128
+ huang
129
+ hui
130
+ hun
131
+ huo
132
+ ji
133
+ jia
134
+ jian
135
+ jiang
136
+ jiao
137
+ jie
138
+ jin
139
+ jing
140
+ jiong
141
+ jiu
142
+ ju
143
+ juan
144
+ jue
145
+ jun
146
+ ka
147
+ kai
148
+ kan
149
+ kang
150
+ kao
151
+ ke
152
+ ken
153
+ keng
154
+ kong
155
+ kou
156
+ ku
157
+ kua
158
+ kuai
159
+ kuan
160
+ kuang
161
+ kui
162
+ kun
163
+ kuo
164
+ la
165
+ lai
166
+ lan
167
+ lang
168
+ lao
169
+ le
170
+ lei
171
+ leng
172
+ li
173
+ lia
174
+ lian
175
+ liang
176
+ liao
177
+ lie
178
+ lin
179
+ ling
180
+ liu
181
+ long
182
+ lou
183
+ lu
184
+ luan
185
+
186
+ lüe
187
+ lun
188
+ luo
189
+ ma
190
+ mai
191
+ man
192
+ mang
193
+ mao
194
+ me
195
+ mei
196
+ men
197
+ meng
198
+ mi
199
+ mian
200
+ miao
201
+ mie
202
+ min
203
+ ming
204
+ miu
205
+ mo
206
+ mou
207
+ mu
208
+ na
209
+ nai
210
+ nan
211
+ nang
212
+ nao
213
+ ne
214
+ nei
215
+ nen
216
+ neng
217
+ ni
218
+ nian
219
+ niang
220
+ niao
221
+ nie
222
+ nin
223
+ ning
224
+ niu
225
+ nong
226
+ nou
227
+ nu
228
+
229
+ nuan
230
+ nüe
231
+ nuo
232
+ nun
233
+ o
234
+ ou
235
+ pa
236
+ pai
237
+ pan
238
+ pang
239
+ pao
240
+ pei
241
+ pen
242
+ peng
243
+ pi
244
+ pian
245
+ piao
246
+ pie
247
+ pin
248
+ ping
249
+ po
250
+ pou
251
+ pu
252
+ qi
253
+ qia
254
+ qian
255
+ qiang
256
+ qiao
257
+ qie
258
+ qin
259
+ qing
260
+ qiong
261
+ qiu
262
+ qu
263
+ quan
264
+ que
265
+ qun
266
+ ran
267
+ rang
268
+ rao
269
+ re
270
+ ren
271
+ reng
272
+ ri
273
+ rong
274
+ rou
275
+ ru
276
+ ruan
277
+ rui
278
+ run
279
+ ruo
280
+ sa
281
+ sai
282
+ san
283
+ sang
284
+ sao
285
+ se
286
+ sen
287
+ seng
288
+ sha
289
+ shai
290
+ shan
291
+ shang
292
+ shao
293
+ she
294
+ shei
295
+ shen
296
+ sheng
297
+ shi
298
+ shou
299
+ shu
300
+ shua
301
+ shuai
302
+ shuan
303
+ shuang
304
+ shui
305
+ shun
306
+ shuo
307
+ si
308
+ song
309
+ sou
310
+ su
311
+ suan
312
+ sui
313
+ sun
314
+ suo
315
+ ta
316
+ tai
317
+ tan
318
+ tang
319
+ tao
320
+ te
321
+ teng
322
+ ti
323
+ tian
324
+ tiao
325
+ tie
326
+ ting
327
+ tong
328
+ tou
329
+ tu
330
+ tuan
331
+ tui
332
+ tun
333
+ tuo
334
+ wa
335
+ wai
336
+ wan
337
+ wang
338
+ wei
339
+ wen
340
+ weng
341
+ wo
342
+ wu
343
+ xi
344
+ xia
345
+ xian
346
+ xiang
347
+ xiao
348
+ xie
349
+ xin
350
+ xing
351
+ xiong
352
+ xiu
353
+ xu
354
+ xuan
355
+ xue
356
+ xun
357
+ ya
358
+ yan
359
+ yang
360
+ yao
361
+ ye
362
+ yi
363
+ yin
364
+ ying
365
+ yo
366
+ yong
367
+ you
368
+ yu
369
+ yuan
370
+ yue
371
+ yun
372
+ za
373
+ zai
374
+ zan
375
+ zang
376
+ zao
377
+ ze
378
+ zei
379
+ zen
380
+ zeng
381
+ zha
382
+ zhai
383
+ zhan
384
+ zhang
385
+ zhao
386
+ zhe
387
+ zhei
388
+ zhen
389
+ zheng
390
+ zhi
391
+ zhong
392
+ zhou
393
+ zhu
394
+ zhua
395
+ zhuai
396
+ zhuan
397
+ zhuang
398
+ zhui
399
+ zhun
400
+ zhuo
401
+ zi
402
+ zong
403
+ zou
404
+ zu
405
+ zuan
406
+ zui
407
+ zun
408
+ zuo
support_language.json ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "original_language":[
3
+ "Achinese (Arabic script)",
4
+ "Achinese (Latin script)",
5
+ "Afrikaans",
6
+ "Akan",
7
+ "Amharic",
8
+ "Arabic",
9
+ "Armenian",
10
+ "Assamese",
11
+ "Asturian",
12
+ "Awadhi",
13
+ "Balinese",
14
+ "Bambara",
15
+ "Banjar (Arabic script)",
16
+ "Banjar (Latin script)",
17
+ "Bashkir",
18
+ "Basque",
19
+ "Belarusian",
20
+ "Bemba",
21
+ "Bengali",
22
+ "Bhojpuri",
23
+ "Bosnian",
24
+ "Buginese",
25
+ "Bulgarian",
26
+ "Catalan",
27
+ "Cebuano",
28
+ "Central Aymara",
29
+ "Central Kurdish",
30
+ "Chhattisgarhi",
31
+ "Chinese",
32
+ "Chokwe",
33
+ "Crimean Tatar",
34
+ "Croatian",
35
+ "Czech",
36
+ "Danish",
37
+ "Dinka",
38
+ "Dutch",
39
+ "Dzongkha",
40
+ "Egyptian Arabic",
41
+ "English",
42
+ "Esperanto",
43
+ "Estonian",
44
+ "Ewe",
45
+ "Faroese",
46
+ "Fijian",
47
+ "Finnish",
48
+ "Fon",
49
+ "French",
50
+ "Friulian",
51
+ "Galician",
52
+ "German",
53
+ "Greek",
54
+ "Guarani",
55
+ "Gujarati",
56
+ "Haitian Creole",
57
+ "Hausa",
58
+ "Hebrew",
59
+ "Hindi",
60
+ "Hungarian",
61
+ "Icelandic",
62
+ "Igbo",
63
+ "Iloko",
64
+ "Indonesian",
65
+ "Irish",
66
+ "Italian",
67
+ "Japanese",
68
+ "Javanese",
69
+ "Jula",
70
+ "Kabyle",
71
+ "Kachin",
72
+ "Kazakh",
73
+ "Khmer",
74
+ "Korean",
75
+ "Lithuanian",
76
+ "Malayalam",
77
+ "Marathi",
78
+ "Mesopotamian Arabic",
79
+ "Moroccan Arabic",
80
+ "Najdi Arabic",
81
+ "Nepali",
82
+ "Nigerian Fulfulde",
83
+ "North Azerbaijani",
84
+ "North Levantine Arabic",
85
+ "Persian",
86
+ "Polish",
87
+ "Portuguese",
88
+ "Russian",
89
+ "Scottish Gaelic",
90
+ "Sinhala",
91
+ "South Azerbaijani",
92
+ "South Levantine Arabic",
93
+ "Spanish",
94
+ "Standard Arabic",
95
+ "Ta'izzi-Adeni Arabic",
96
+ "Tamil",
97
+ "Thai",
98
+ "Tibetan",
99
+ "Tunisian Arabic",
100
+ "Turkish",
101
+ "Ukrainian",
102
+ "Urdu",
103
+ "Vietnamese",
104
+ "Welsh"
105
+ ],
106
+ "target_language":[
107
+ "Achinese (Arabic script)",
108
+ "Achinese (Latin script)",
109
+ "Afrikaans",
110
+ "Akan",
111
+ "Amharic",
112
+ "Arabic",
113
+ "Armenian",
114
+ "Assamese",
115
+ "Asturian",
116
+ "Awadhi",
117
+ "Balinese",
118
+ "Bambara",
119
+ "Banjar (Arabic script)",
120
+ "Banjar (Latin script)",
121
+ "Bashkir",
122
+ "Basque",
123
+ "Belarusian",
124
+ "Bemba",
125
+ "Bengali",
126
+ "Bhojpuri",
127
+ "Bosnian",
128
+ "Buginese",
129
+ "Bulgarian",
130
+ "Catalan",
131
+ "Cebuano",
132
+ "Central Aymara",
133
+ "Central Kurdish",
134
+ "Chhattisgarhi",
135
+ "Chinese",
136
+ "Chokwe",
137
+ "Crimean Tatar",
138
+ "Croatian",
139
+ "Czech",
140
+ "Danish",
141
+ "Dinka",
142
+ "Dutch",
143
+ "Dzongkha",
144
+ "Egyptian Arabic",
145
+ "English",
146
+ "Esperanto",
147
+ "Estonian",
148
+ "Ewe",
149
+ "Faroese",
150
+ "Fijian",
151
+ "Finnish",
152
+ "Fon",
153
+ "French",
154
+ "Friulian",
155
+ "Galician",
156
+ "German",
157
+ "Greek",
158
+ "Guarani",
159
+ "Gujarati",
160
+ "Haitian Creole",
161
+ "Hausa",
162
+ "Hebrew",
163
+ "Hindi",
164
+ "Hungarian",
165
+ "Icelandic",
166
+ "Igbo",
167
+ "Iloko",
168
+ "Indonesian",
169
+ "Irish",
170
+ "Italian",
171
+ "Japanese",
172
+ "Javanese",
173
+ "Jula",
174
+ "Kabyle",
175
+ "Kachin",
176
+ "Kazakh",
177
+ "Khmer",
178
+ "Korean",
179
+ "Lithuanian",
180
+ "Malayalam",
181
+ "Marathi",
182
+ "Mesopotamian Arabic",
183
+ "Moroccan Arabic",
184
+ "Najdi Arabic",
185
+ "Nepali",
186
+ "Nigerian Fulfulde",
187
+ "North Azerbaijani",
188
+ "North Levantine Arabic",
189
+ "Persian",
190
+ "Polish",
191
+ "Portuguese",
192
+ "Russian",
193
+ "Scottish Gaelic",
194
+ "Sinhala",
195
+ "South Azerbaijani",
196
+ "South Levantine Arabic",
197
+ "Spanish",
198
+ "Standard Arabic",
199
+ "Ta'izzi-Adeni Arabic",
200
+ "Tamil",
201
+ "Thai",
202
+ "Tibetan",
203
+ "Tunisian Arabic",
204
+ "Turkish",
205
+ "Ukrainian",
206
+ "Urdu",
207
+ "Vietnamese",
208
+ "Welsh"
209
+ ]
210
+ }