Upload 4 files
Browse files- README.md +12 -199
- model.py +416 -0
- pinyin.txt +408 -0
- support_language.json +210 -0
README.md
CHANGED
@@ -1,199 +1,12 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
### Model Description
|
15 |
-
|
16 |
-
<!-- Provide a longer summary of what this model is. -->
|
17 |
-
|
18 |
-
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
|
19 |
-
|
20 |
-
- **Developed by:** [More Information Needed]
|
21 |
-
- **Funded by [optional]:** [More Information Needed]
|
22 |
-
- **Shared by [optional]:** [More Information Needed]
|
23 |
-
- **Model type:** [More Information Needed]
|
24 |
-
- **Language(s) (NLP):** [More Information Needed]
|
25 |
-
- **License:** [More Information Needed]
|
26 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
27 |
-
|
28 |
-
### Model Sources [optional]
|
29 |
-
|
30 |
-
<!-- Provide the basic links for the model. -->
|
31 |
-
|
32 |
-
- **Repository:** [More Information Needed]
|
33 |
-
- **Paper [optional]:** [More Information Needed]
|
34 |
-
- **Demo [optional]:** [More Information Needed]
|
35 |
-
|
36 |
-
## Uses
|
37 |
-
|
38 |
-
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
39 |
-
|
40 |
-
### Direct Use
|
41 |
-
|
42 |
-
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
43 |
-
|
44 |
-
[More Information Needed]
|
45 |
-
|
46 |
-
### Downstream Use [optional]
|
47 |
-
|
48 |
-
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
49 |
-
|
50 |
-
[More Information Needed]
|
51 |
-
|
52 |
-
### Out-of-Scope Use
|
53 |
-
|
54 |
-
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
55 |
-
|
56 |
-
[More Information Needed]
|
57 |
-
|
58 |
-
## Bias, Risks, and Limitations
|
59 |
-
|
60 |
-
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
61 |
-
|
62 |
-
[More Information Needed]
|
63 |
-
|
64 |
-
### Recommendations
|
65 |
-
|
66 |
-
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
67 |
-
|
68 |
-
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
|
69 |
-
|
70 |
-
## How to Get Started with the Model
|
71 |
-
|
72 |
-
Use the code below to get started with the model.
|
73 |
-
|
74 |
-
[More Information Needed]
|
75 |
-
|
76 |
-
## Training Details
|
77 |
-
|
78 |
-
### Training Data
|
79 |
-
|
80 |
-
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
81 |
-
|
82 |
-
[More Information Needed]
|
83 |
-
|
84 |
-
### Training Procedure
|
85 |
-
|
86 |
-
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
87 |
-
|
88 |
-
#### Preprocessing [optional]
|
89 |
-
|
90 |
-
[More Information Needed]
|
91 |
-
|
92 |
-
|
93 |
-
#### Training Hyperparameters
|
94 |
-
|
95 |
-
- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
|
96 |
-
|
97 |
-
#### Speeds, Sizes, Times [optional]
|
98 |
-
|
99 |
-
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
|
100 |
-
|
101 |
-
[More Information Needed]
|
102 |
-
|
103 |
-
## Evaluation
|
104 |
-
|
105 |
-
<!-- This section describes the evaluation protocols and provides the results. -->
|
106 |
-
|
107 |
-
### Testing Data, Factors & Metrics
|
108 |
-
|
109 |
-
#### Testing Data
|
110 |
-
|
111 |
-
<!-- This should link to a Dataset Card if possible. -->
|
112 |
-
|
113 |
-
[More Information Needed]
|
114 |
-
|
115 |
-
#### Factors
|
116 |
-
|
117 |
-
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
118 |
-
|
119 |
-
[More Information Needed]
|
120 |
-
|
121 |
-
#### Metrics
|
122 |
-
|
123 |
-
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
124 |
-
|
125 |
-
[More Information Needed]
|
126 |
-
|
127 |
-
### Results
|
128 |
-
|
129 |
-
[More Information Needed]
|
130 |
-
|
131 |
-
#### Summary
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
## Model Examination [optional]
|
136 |
-
|
137 |
-
<!-- Relevant interpretability work for the model goes here -->
|
138 |
-
|
139 |
-
[More Information Needed]
|
140 |
-
|
141 |
-
## Environmental Impact
|
142 |
-
|
143 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
144 |
-
|
145 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
146 |
-
|
147 |
-
- **Hardware Type:** [More Information Needed]
|
148 |
-
- **Hours used:** [More Information Needed]
|
149 |
-
- **Cloud Provider:** [More Information Needed]
|
150 |
-
- **Compute Region:** [More Information Needed]
|
151 |
-
- **Carbon Emitted:** [More Information Needed]
|
152 |
-
|
153 |
-
## Technical Specifications [optional]
|
154 |
-
|
155 |
-
### Model Architecture and Objective
|
156 |
-
|
157 |
-
[More Information Needed]
|
158 |
-
|
159 |
-
### Compute Infrastructure
|
160 |
-
|
161 |
-
[More Information Needed]
|
162 |
-
|
163 |
-
#### Hardware
|
164 |
-
|
165 |
-
[More Information Needed]
|
166 |
-
|
167 |
-
#### Software
|
168 |
-
|
169 |
-
[More Information Needed]
|
170 |
-
|
171 |
-
## Citation [optional]
|
172 |
-
|
173 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
174 |
-
|
175 |
-
**BibTeX:**
|
176 |
-
|
177 |
-
[More Information Needed]
|
178 |
-
|
179 |
-
**APA:**
|
180 |
-
|
181 |
-
[More Information Needed]
|
182 |
-
|
183 |
-
## Glossary [optional]
|
184 |
-
|
185 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
186 |
-
|
187 |
-
[More Information Needed]
|
188 |
-
|
189 |
-
## More Information [optional]
|
190 |
-
|
191 |
-
[More Information Needed]
|
192 |
-
|
193 |
-
## Model Card Authors [optional]
|
194 |
-
|
195 |
-
[More Information Needed]
|
196 |
-
|
197 |
-
## Model Card Contact
|
198 |
-
|
199 |
-
[More Information Needed]
|
|
|
1 |
+
# Qwen2-7B-Instruct-Full-Finetune
|
2 |
+
Qwen2-7B-Instruct-Full-Finetune is a cutting-edge model designed for multilingual instruction-following tasks. Developed to bridge gaps in multilingual conversational AI, Qwen2-7B focuses on understanding and generating contextually accurate responses across diverse languages and dialects. This model empowers developers, researchers, and businesses to integrate advanced language processing capabilities into their applications, offering natural and fluent interactions across a global spectrum of languages.
|
3 |
+
|
4 |
+
## Introduction
|
5 |
+
Qwen2-7B-Instruct-Full-Finetune is an initiative to expand conversational AI's capabilities in a way that ensures inclusivity, flexibility, and ease of integration. Built with a deep focus on multilingual understanding, it leverages large-scale fine-tuning and instruction-based training techniques to generate nuanced and context-aware responses. This project aims to support language diversity and inclusivity, making it accessible for use cases ranging from customer support and education to content creation and translation.
|
6 |
+
|
7 |
+
## Features
|
8 |
+
Multilingual Understanding: Qwen2-7B-Instruct-Full-Finetune supports a wide range of languages, including those often underrepresented in standard AI models, providing accurate and culturally aware responses across diverse linguistic contexts.
|
9 |
+
Contextual and Instruction-Based Responses: Trained to follow complex instructions and maintain conversational flow, it delivers responses that are contextually relevant and engaging, making it ideal for a variety of instructional tasks.
|
10 |
+
Scalability and Integration: Designed with integration in mind, Qwen2-7B can be easily deployed in applications like chatbots, virtual assistants, and customer support systems, extending the potential for multilingual interaction in various digital products.
|
11 |
+
Open Source and Community Driven: Aligned with the principles of open science, Qwen2-7B-Instruct-Full-Finetune is available for the developer community to use, adapt, and enhance, fostering collaboration and furthering innovation in conversational AI.
|
12 |
+
This README is a guide for those looking to leverage Qwen2-7B-Instruct-Full-Finetune to build inclusive, multilingual, and instruction-following AI applications.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
model.py
ADDED
@@ -0,0 +1,416 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
2 |
+
import torch
|
3 |
+
from modules.file import ExcelFileWriter
|
4 |
+
import os
|
5 |
+
|
6 |
+
from abc import ABC, abstractmethod
|
7 |
+
from typing import List
|
8 |
+
import re
|
9 |
+
|
10 |
+
class FilterPipeline():
|
11 |
+
def __init__(self, filter_list):
|
12 |
+
self._filter_list:List[Filter] = filter_list
|
13 |
+
|
14 |
+
def append(self, filter):
|
15 |
+
self._filter_list.append(filter)
|
16 |
+
|
17 |
+
def batch_encoder(self, inputs):
|
18 |
+
for filter in self._filter_list:
|
19 |
+
inputs = filter.encoder(inputs)
|
20 |
+
return inputs
|
21 |
+
|
22 |
+
def batch_decoder(self, inputs):
|
23 |
+
for filter in reversed(self._filter_list):
|
24 |
+
inputs = filter.decoder(inputs)
|
25 |
+
return inputs
|
26 |
+
|
27 |
+
class Filter(ABC):
|
28 |
+
# 抽象基类,用于定义过滤器的基本接口
|
29 |
+
def __init__(self):
|
30 |
+
self.name = 'filter' # 过滤器的名称
|
31 |
+
self.code = [] # 存储过滤或编码信息
|
32 |
+
@abstractmethod
|
33 |
+
def encoder(self, inputs):
|
34 |
+
# 抽象方法,编码或过滤输入的接口
|
35 |
+
pass
|
36 |
+
|
37 |
+
@abstractmethod
|
38 |
+
def decoder(self, inputs):
|
39 |
+
# 抽象方法,解码或还原输入的接口
|
40 |
+
pass
|
41 |
+
|
42 |
+
class SpecialTokenFilter(Filter):
|
43 |
+
# 特殊字符过滤器,用于过滤特定的特殊字符字符串
|
44 |
+
def __init__(self):
|
45 |
+
self.name = 'special token filter'
|
46 |
+
self.code = []
|
47 |
+
self.special_tokens = ['!', '!', '-'] # 定义特殊字符集
|
48 |
+
|
49 |
+
def encoder(self, inputs):
|
50 |
+
# 编码函数,过滤掉仅包含特殊字符的字符串
|
51 |
+
filtered_inputs = []
|
52 |
+
self.code = []
|
53 |
+
for i, input_str in enumerate(inputs):
|
54 |
+
if not all(char in self.special_tokens for char in input_str):
|
55 |
+
filtered_inputs.append(input_str)
|
56 |
+
else:
|
57 |
+
self.code.append([i, input_str]) # 将特殊字符字符串的位置和内容保存
|
58 |
+
return filtered_inputs
|
59 |
+
|
60 |
+
def decoder(self, inputs):
|
61 |
+
# 解码函数,将被过滤的特殊字符字符串还原
|
62 |
+
original_inputs = inputs.copy()
|
63 |
+
for removed_indice in self.code:
|
64 |
+
original_inputs.insert(removed_indice[0], removed_indice[1]) # 恢复原始位置的字符串
|
65 |
+
return original_inputs
|
66 |
+
|
67 |
+
class SperSignFilter(Filter):
|
68 |
+
# 特殊标记过滤器,用于处理包含 '%s' 的字符串
|
69 |
+
def __init__(self):
|
70 |
+
self.name = 's percentage sign filter'
|
71 |
+
self.code = []
|
72 |
+
|
73 |
+
def encoder(self, inputs):
|
74 |
+
# 编码函数,将 '%s' 替换为 '*'
|
75 |
+
encoded_inputs = []
|
76 |
+
self.code = []
|
77 |
+
for i, input_str in enumerate(inputs):
|
78 |
+
if '%s' in input_str:
|
79 |
+
encoded_str = input_str.replace('%s', '*')
|
80 |
+
self.code.append(i) # 保存包含 '%s' 的字符串位置
|
81 |
+
else:
|
82 |
+
encoded_str = input_str
|
83 |
+
encoded_inputs.append(encoded_str)
|
84 |
+
return encoded_inputs
|
85 |
+
|
86 |
+
def decoder(self, inputs):
|
87 |
+
# 解码函数,将 '*' 还原为 '%s'
|
88 |
+
decoded_inputs = inputs.copy()
|
89 |
+
for i in self.code:
|
90 |
+
decoded_inputs[i] = decoded_inputs[i].replace('*', '%s')
|
91 |
+
return decoded_inputs
|
92 |
+
|
93 |
+
class ParenSParenFilter(Filter):
|
94 |
+
# 特殊字符串过滤器,用于处理 '(s)' 的字符串
|
95 |
+
def __init__(self):
|
96 |
+
self.name = 'Paren s paren filter'
|
97 |
+
self.code = []
|
98 |
+
|
99 |
+
def encoder(self, inputs):
|
100 |
+
# 编码函数,将 '(s)' 替换为 '$'
|
101 |
+
encoded_inputs = []
|
102 |
+
self.code = []
|
103 |
+
for i, input_str in enumerate(inputs):
|
104 |
+
if '(s)' in input_str:
|
105 |
+
encoded_str = input_str.replace('(s)', '$')
|
106 |
+
self.code.append(i) # 保存包含 '(s)' 的字符串位置
|
107 |
+
else:
|
108 |
+
encoded_str = input_str
|
109 |
+
encoded_inputs.append(encoded_str)
|
110 |
+
return encoded_inputs
|
111 |
+
|
112 |
+
def decoder(self, inputs):
|
113 |
+
# 解码函数,将 '$' 还原为 '(s)'
|
114 |
+
decoded_inputs = inputs.copy()
|
115 |
+
for i in self.code:
|
116 |
+
decoded_inputs[i] = decoded_inputs[i].replace('$', '(s)')
|
117 |
+
return decoded_inputs
|
118 |
+
|
119 |
+
class ChevronsFilter(Filter):
|
120 |
+
# 尖括号过滤器,用于处理包含 '<>' 内容的字符串
|
121 |
+
def __init__(self):
|
122 |
+
self.name = 'chevrons filter'
|
123 |
+
self.code = []
|
124 |
+
|
125 |
+
def encoder(self, inputs):
|
126 |
+
# 编码函数,将尖括号内的内容替换为 '#'
|
127 |
+
encoded_inputs = []
|
128 |
+
self.code = []
|
129 |
+
pattern = re.compile(r'<.*?>')
|
130 |
+
for i, input_str in enumerate(inputs):
|
131 |
+
if pattern.search(input_str):
|
132 |
+
matches = pattern.findall(input_str)
|
133 |
+
encoded_str = pattern.sub('#', input_str)
|
134 |
+
self.code.append((i, matches)) # 保存匹配内容的位置和内容
|
135 |
+
else:
|
136 |
+
encoded_str = input_str
|
137 |
+
encoded_inputs.append(encoded_str)
|
138 |
+
return encoded_inputs
|
139 |
+
|
140 |
+
def decoder(self, inputs):
|
141 |
+
# 解码函数,将 '#' 还原为尖括号内的原内容
|
142 |
+
decoded_inputs = inputs.copy()
|
143 |
+
for i, matches in self.code:
|
144 |
+
for match in matches:
|
145 |
+
decoded_inputs[i] = decoded_inputs[i].replace('#', match, 1)
|
146 |
+
return decoded_inputs
|
147 |
+
|
148 |
+
class SimilarFilter(Filter):
|
149 |
+
# 相似字符串过滤器,用于处理只在数字上有区别的字符串
|
150 |
+
def __init__(self):
|
151 |
+
self.name = 'similar filter'
|
152 |
+
self.code = []
|
153 |
+
|
154 |
+
def is_similar(self, str1, str2):
|
155 |
+
# 判断两个字符串是否相似(忽略数字)
|
156 |
+
pattern = re.compile(r'\d+')
|
157 |
+
return pattern.sub('', str1) == pattern.sub('', str2)
|
158 |
+
|
159 |
+
def encoder(self, inputs):
|
160 |
+
# 编码函数,检测连续的相似字符串,记录索引和内容
|
161 |
+
encoded_inputs = []
|
162 |
+
self.code = []
|
163 |
+
i = 0
|
164 |
+
while i < len(inputs):
|
165 |
+
encoded_inputs.append(inputs[i])
|
166 |
+
similar_strs = [inputs[i]]
|
167 |
+
j = i + 1
|
168 |
+
while j < len(inputs) and self.is_similar(inputs[i], inputs[j]):
|
169 |
+
similar_strs.append(inputs[j])
|
170 |
+
j += 1
|
171 |
+
if len(similar_strs) > 1:
|
172 |
+
self.code.append((i, similar_strs))
|
173 |
+
i = j
|
174 |
+
return encoded_inputs
|
175 |
+
|
176 |
+
def decoder(self, inputs):
|
177 |
+
# 解码函数,将被检测的相似字符串插回原位置
|
178 |
+
decoded_inputs = inputs
|
179 |
+
for i, similar_strs in self.code:
|
180 |
+
pattern = re.compile(r'\d+')
|
181 |
+
for j in range(len(similar_strs)):
|
182 |
+
if pattern.search(similar_strs[j]):
|
183 |
+
number = re.findall(r'\d+', similar_strs[j])[0]
|
184 |
+
new_str = pattern.sub(number, inputs[i])
|
185 |
+
else:
|
186 |
+
new_str = inputs[i]
|
187 |
+
if j > 0:
|
188 |
+
decoded_inputs.insert(i + j, new_str)
|
189 |
+
return decoded_inputs
|
190 |
+
|
191 |
+
class ChineseFilter:
|
192 |
+
# 中文拼音过滤器,用于检测并过滤中文拼音单词
|
193 |
+
def __init__(self, pinyin_lib_file='pinyin.txt'):
|
194 |
+
self.name = 'chinese filter'
|
195 |
+
self.code = []
|
196 |
+
self.pinyin_lib = self.load_pinyin_lib(pinyin_lib_file) # 加载拼音库
|
197 |
+
|
198 |
+
def load_pinyin_lib(self, file_path):
|
199 |
+
# 加载拼音库文件到内存中
|
200 |
+
with open(os.path.join(script_dir, file_path), 'r', encoding='utf-8') as f:
|
201 |
+
return set(line.strip().lower() for line in f)
|
202 |
+
|
203 |
+
def is_valid_chinese(self, word):
|
204 |
+
# 判断一个单词是否符合要求: 单词仅由一个单词构成且首字母大写
|
205 |
+
if len(word.split()) == 1 and word[0].isupper():
|
206 |
+
return self.is_pinyin(word.lower())
|
207 |
+
return False
|
208 |
+
|
209 |
+
def encoder(self, inputs):
|
210 |
+
# 编码函数,检测并过滤符合拼音规则的中文单词
|
211 |
+
encoded_inputs = []
|
212 |
+
self.code = []
|
213 |
+
for i, word in enumerate(inputs):
|
214 |
+
if self.is_valid_chinese(word):
|
215 |
+
self.code.append((i, word)) # 保存符合要求的中文单词及其索引
|
216 |
+
else:
|
217 |
+
encoded_inputs.append(word)
|
218 |
+
return encoded_inputs
|
219 |
+
|
220 |
+
def decoder(self, inputs):
|
221 |
+
# 解码函数,将符合拼音规则的中文单词还原到原位置
|
222 |
+
decoded_inputs = inputs.copy()
|
223 |
+
for i, word in self.code:
|
224 |
+
decoded_inputs.insert(i, word)
|
225 |
+
return decoded_inputs
|
226 |
+
|
227 |
+
def is_pinyin(self, string):
|
228 |
+
# 判断字符串是否是拼音或英文单词
|
229 |
+
string = string.lower()
|
230 |
+
stringlen = len(string)
|
231 |
+
max_len = 6
|
232 |
+
result = []
|
233 |
+
n = 0
|
234 |
+
while n < stringlen:
|
235 |
+
matched = 0
|
236 |
+
temp_result = []
|
237 |
+
for i in range(max_len, 0, -1):
|
238 |
+
s = string[0:i]
|
239 |
+
if s in self.pinyin_lib:
|
240 |
+
temp_result.append(string[:i])
|
241 |
+
matched = i
|
242 |
+
break
|
243 |
+
if i == 1 and len(temp_result) == 0:
|
244 |
+
return False
|
245 |
+
result.extend(temp_result)
|
246 |
+
string = string[matched:]
|
247 |
+
n += matched
|
248 |
+
return True
|
249 |
+
|
250 |
+
# 定义脚本目录的路径,供拼音文件加载使用
|
251 |
+
script_dir = os.path.dirname(os.path.abspath(__file__))
|
252 |
+
parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(script_dir)))
|
253 |
+
|
254 |
+
|
255 |
+
class Model():
|
256 |
+
def __init__(self, modelname, selected_lora_model, selected_gpu):
|
257 |
+
def get_gpu_index(gpu_info, target_gpu_name):
|
258 |
+
"""
|
259 |
+
从 GPU 信息中获取目标 GPU 的索引
|
260 |
+
Args:
|
261 |
+
gpu_info (list): 包含 GPU 名称的列表
|
262 |
+
target_gpu_name (str): 目标 GPU 的名称
|
263 |
+
|
264 |
+
Returns:
|
265 |
+
int: 目标 GPU 的索引,如果未找到则返回 -1
|
266 |
+
"""
|
267 |
+
for i, name in enumerate(gpu_info):
|
268 |
+
if target_gpu_name.lower() in name.lower():
|
269 |
+
return i
|
270 |
+
return -1
|
271 |
+
if selected_gpu != "cpu":
|
272 |
+
gpu_count = torch.cuda.device_count()
|
273 |
+
gpu_info = [torch.cuda.get_device_name(i) for i in range(gpu_count)]
|
274 |
+
selected_gpu_index = get_gpu_index(gpu_info, selected_gpu)
|
275 |
+
self.device_name = f"cuda:{selected_gpu_index}"
|
276 |
+
else:
|
277 |
+
self.device_name = "cpu"
|
278 |
+
print("device_name", self.device_name)
|
279 |
+
self.model = AutoModelForCausalLM.from_pretrained(modelname, torch_dtype="auto").to(self.device_name)
|
280 |
+
self.tokenizer = AutoTokenizer.from_pretrained(modelname)
|
281 |
+
# self.translator = pipeline('translation', model=self.original_model, tokenizer=self.tokenizer, src_lang=original_language, tgt_lang=target_language, device=device)
|
282 |
+
|
283 |
+
def generate(self, inputs, original_language, target_languages, max_batch_size):
|
284 |
+
filter_list = [SpecialTokenFilter(), ChevronsFilter(), SimilarFilter(), ChineseFilter()]
|
285 |
+
filter_pipeline = FilterPipeline(filter_list)
|
286 |
+
def process_gpu_translate_result(temp_outputs):
|
287 |
+
outputs = []
|
288 |
+
for temp_output in temp_outputs:
|
289 |
+
length = len(temp_output[0]["generated_translation"])
|
290 |
+
for i in range(length):
|
291 |
+
temp = []
|
292 |
+
for trans in temp_output:
|
293 |
+
temp.append({
|
294 |
+
"target_language": trans["target_language"],
|
295 |
+
"generated_translation": trans['generated_translation'][i],
|
296 |
+
})
|
297 |
+
outputs.append(temp)
|
298 |
+
excel_writer = ExcelFileWriter()
|
299 |
+
excel_writer.write_text(os.path.join(parent_dir,r"temp/empty.xlsx"), outputs, 'A', 1, len(outputs))
|
300 |
+
if self.device_name == "cpu":
|
301 |
+
# Tokenize input
|
302 |
+
input_ids = self.tokenizer(inputs, return_tensors="pt", padding=True, max_length=128).to(self.device_name)
|
303 |
+
output = []
|
304 |
+
for target_language in target_languages:
|
305 |
+
# Get language code for the target language
|
306 |
+
target_lang_code = self.tokenizer.lang_code_to_id[language_mapping(target_language)]
|
307 |
+
# Generate translation
|
308 |
+
generated_tokens = self.model.generate(
|
309 |
+
**input_ids,
|
310 |
+
forced_bos_token_id=target_lang_code,
|
311 |
+
max_length=128
|
312 |
+
)
|
313 |
+
generated_translation = self.tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
314 |
+
# Append result to output
|
315 |
+
output.append({
|
316 |
+
"target_language": target_language,
|
317 |
+
"generated_translation": generated_translation,
|
318 |
+
})
|
319 |
+
outputs = []
|
320 |
+
length = len(output[0]["generated_translation"])
|
321 |
+
for i in range(length):
|
322 |
+
temp = []
|
323 |
+
for trans in output:
|
324 |
+
temp.append({
|
325 |
+
"target_language": trans["target_language"],
|
326 |
+
"generated_translation": trans['generated_translation'][i],
|
327 |
+
})
|
328 |
+
outputs.append(temp)
|
329 |
+
return outputs
|
330 |
+
else:
|
331 |
+
# 最大批量大小 = 可用 GPU 内存字节数 / 4 / (张量大小 + 可训练参数)
|
332 |
+
# max_batch_size = 10
|
333 |
+
# Ensure batch size is within model limits:
|
334 |
+
print("length of inputs: ",len(inputs))
|
335 |
+
batch_size = min(len(inputs), int(max_batch_size))
|
336 |
+
batches = [inputs[i:i + batch_size] for i in range(0, len(inputs), batch_size)]
|
337 |
+
print("length of batches size: ", len(batches))
|
338 |
+
temp_outputs = []
|
339 |
+
processed_num = 0
|
340 |
+
for index, batch in enumerate(batches):
|
341 |
+
# Tokenize input
|
342 |
+
print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
|
343 |
+
print(len(batch))
|
344 |
+
print(batch)
|
345 |
+
batch = filter_pipeline.batch_encoder(batch)
|
346 |
+
print(batch)
|
347 |
+
temp = []
|
348 |
+
if len(batch) > 0:
|
349 |
+
for target_language in target_languages:
|
350 |
+
batch_messages = [[
|
351 |
+
{"role": "system", "content": f"你是一个ERP系统中译英专家,你任务是把markdown格式的文本,保留其格式并从{original_language}翻译成{target_language},不要添加多余的内容。"},
|
352 |
+
{"role": "user", "content": input},
|
353 |
+
] for input in batch]
|
354 |
+
batch_texts = [self.tokenizer.apply_chat_template(
|
355 |
+
messages,
|
356 |
+
tokenize=False,
|
357 |
+
add_generation_prompt=True
|
358 |
+
) for messages in batch_messages]
|
359 |
+
self.tokenizer.padding_side = "left"
|
360 |
+
model_inputs = self.tokenizer(
|
361 |
+
batch_texts,
|
362 |
+
return_tensors="pt",
|
363 |
+
padding="longest",
|
364 |
+
truncation=True,
|
365 |
+
).to(self.device_name)
|
366 |
+
generated_ids = self.model.generate(
|
367 |
+
max_new_tokens=512,
|
368 |
+
**model_inputs
|
369 |
+
)
|
370 |
+
# Calculate the length of new tokens generated for each sequence
|
371 |
+
new_tokens = [
|
372 |
+
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
|
373 |
+
]
|
374 |
+
generated_translation = self.tokenizer.batch_decode(new_tokens, skip_special_tokens=True)
|
375 |
+
# Append result to output
|
376 |
+
temp.append({
|
377 |
+
"target_language": target_language,
|
378 |
+
"generated_translation": generated_translation,
|
379 |
+
})
|
380 |
+
model_inputs.to('cpu')
|
381 |
+
del model_inputs
|
382 |
+
else:
|
383 |
+
for target_language in target_languages:
|
384 |
+
generated_translation = filter_pipeline.batch_decoder(batch)
|
385 |
+
print(generated_translation)
|
386 |
+
print(len(generated_translation))
|
387 |
+
# Append result to output
|
388 |
+
temp.append({
|
389 |
+
"target_language": target_language,
|
390 |
+
"generated_translation": generated_translation,
|
391 |
+
})
|
392 |
+
temp_outputs.append(temp)
|
393 |
+
processed_num += len(batch)
|
394 |
+
if (index + 1) * max_batch_size // 1000 - index * max_batch_size // 1000 == 1:
|
395 |
+
print("Already processed number: ", len(temp_outputs))
|
396 |
+
process_gpu_translate_result(temp_outputs)
|
397 |
+
outputs = []
|
398 |
+
for temp_output in temp_outputs:
|
399 |
+
length = len(temp_output[0]["generated_translation"])
|
400 |
+
for i in range(length):
|
401 |
+
temp = []
|
402 |
+
for trans in temp_output:
|
403 |
+
temp.append({
|
404 |
+
"target_language": trans["target_language"],
|
405 |
+
"generated_translation": trans['generated_translation'][i],
|
406 |
+
})
|
407 |
+
outputs.append(temp)
|
408 |
+
return outputs
|
409 |
+
for filter in self._filter_list:
|
410 |
+
inputs = filter.encoder(inputs)
|
411 |
+
return inputs
|
412 |
+
|
413 |
+
def batch_decoder(self, inputs):
|
414 |
+
for filter in reversed(self._filter_list):
|
415 |
+
inputs = filter.decoder(inputs)
|
416 |
+
return inputs
|
pinyin.txt
ADDED
@@ -0,0 +1,408 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
a
|
2 |
+
ai
|
3 |
+
an
|
4 |
+
ang
|
5 |
+
ao
|
6 |
+
ba
|
7 |
+
bai
|
8 |
+
ban
|
9 |
+
bang
|
10 |
+
bao
|
11 |
+
bei
|
12 |
+
ben
|
13 |
+
beng
|
14 |
+
bi
|
15 |
+
bian
|
16 |
+
biao
|
17 |
+
bie
|
18 |
+
bin
|
19 |
+
bing
|
20 |
+
bo
|
21 |
+
bu
|
22 |
+
ca
|
23 |
+
cai
|
24 |
+
can
|
25 |
+
cang
|
26 |
+
cao
|
27 |
+
ce
|
28 |
+
cen
|
29 |
+
ceng
|
30 |
+
cha
|
31 |
+
chai
|
32 |
+
chan
|
33 |
+
chang
|
34 |
+
chao
|
35 |
+
che
|
36 |
+
chen
|
37 |
+
cheng
|
38 |
+
chi
|
39 |
+
chong
|
40 |
+
chou
|
41 |
+
chu
|
42 |
+
chua
|
43 |
+
chuai
|
44 |
+
chuan
|
45 |
+
chuang
|
46 |
+
chui
|
47 |
+
chun
|
48 |
+
chuo
|
49 |
+
ci
|
50 |
+
cong
|
51 |
+
cou
|
52 |
+
cu
|
53 |
+
cuan
|
54 |
+
cui
|
55 |
+
cun
|
56 |
+
cuo
|
57 |
+
da
|
58 |
+
dai
|
59 |
+
dan
|
60 |
+
dang
|
61 |
+
dao
|
62 |
+
de
|
63 |
+
dei
|
64 |
+
den
|
65 |
+
deng
|
66 |
+
di
|
67 |
+
dia
|
68 |
+
dian
|
69 |
+
diao
|
70 |
+
die
|
71 |
+
ding
|
72 |
+
diu
|
73 |
+
dong
|
74 |
+
dou
|
75 |
+
du
|
76 |
+
duan
|
77 |
+
dui
|
78 |
+
dun
|
79 |
+
duo
|
80 |
+
e
|
81 |
+
ei
|
82 |
+
en
|
83 |
+
eng
|
84 |
+
er
|
85 |
+
fa
|
86 |
+
fan
|
87 |
+
fang
|
88 |
+
fei
|
89 |
+
fen
|
90 |
+
feng
|
91 |
+
fo
|
92 |
+
fou
|
93 |
+
fu
|
94 |
+
ga
|
95 |
+
gai
|
96 |
+
gan
|
97 |
+
gang
|
98 |
+
gao
|
99 |
+
ge
|
100 |
+
gei
|
101 |
+
gen
|
102 |
+
geng
|
103 |
+
gong
|
104 |
+
gou
|
105 |
+
gu
|
106 |
+
gua
|
107 |
+
guai
|
108 |
+
guan
|
109 |
+
guang
|
110 |
+
gui
|
111 |
+
gun
|
112 |
+
guo
|
113 |
+
ha
|
114 |
+
hai
|
115 |
+
han
|
116 |
+
hang
|
117 |
+
hao
|
118 |
+
he
|
119 |
+
hei
|
120 |
+
hen
|
121 |
+
heng
|
122 |
+
hong
|
123 |
+
hou
|
124 |
+
hu
|
125 |
+
hua
|
126 |
+
huai
|
127 |
+
huan
|
128 |
+
huang
|
129 |
+
hui
|
130 |
+
hun
|
131 |
+
huo
|
132 |
+
ji
|
133 |
+
jia
|
134 |
+
jian
|
135 |
+
jiang
|
136 |
+
jiao
|
137 |
+
jie
|
138 |
+
jin
|
139 |
+
jing
|
140 |
+
jiong
|
141 |
+
jiu
|
142 |
+
ju
|
143 |
+
juan
|
144 |
+
jue
|
145 |
+
jun
|
146 |
+
ka
|
147 |
+
kai
|
148 |
+
kan
|
149 |
+
kang
|
150 |
+
kao
|
151 |
+
ke
|
152 |
+
ken
|
153 |
+
keng
|
154 |
+
kong
|
155 |
+
kou
|
156 |
+
ku
|
157 |
+
kua
|
158 |
+
kuai
|
159 |
+
kuan
|
160 |
+
kuang
|
161 |
+
kui
|
162 |
+
kun
|
163 |
+
kuo
|
164 |
+
la
|
165 |
+
lai
|
166 |
+
lan
|
167 |
+
lang
|
168 |
+
lao
|
169 |
+
le
|
170 |
+
lei
|
171 |
+
leng
|
172 |
+
li
|
173 |
+
lia
|
174 |
+
lian
|
175 |
+
liang
|
176 |
+
liao
|
177 |
+
lie
|
178 |
+
lin
|
179 |
+
ling
|
180 |
+
liu
|
181 |
+
long
|
182 |
+
lou
|
183 |
+
lu
|
184 |
+
luan
|
185 |
+
lü
|
186 |
+
lüe
|
187 |
+
lun
|
188 |
+
luo
|
189 |
+
ma
|
190 |
+
mai
|
191 |
+
man
|
192 |
+
mang
|
193 |
+
mao
|
194 |
+
me
|
195 |
+
mei
|
196 |
+
men
|
197 |
+
meng
|
198 |
+
mi
|
199 |
+
mian
|
200 |
+
miao
|
201 |
+
mie
|
202 |
+
min
|
203 |
+
ming
|
204 |
+
miu
|
205 |
+
mo
|
206 |
+
mou
|
207 |
+
mu
|
208 |
+
na
|
209 |
+
nai
|
210 |
+
nan
|
211 |
+
nang
|
212 |
+
nao
|
213 |
+
ne
|
214 |
+
nei
|
215 |
+
nen
|
216 |
+
neng
|
217 |
+
ni
|
218 |
+
nian
|
219 |
+
niang
|
220 |
+
niao
|
221 |
+
nie
|
222 |
+
nin
|
223 |
+
ning
|
224 |
+
niu
|
225 |
+
nong
|
226 |
+
nou
|
227 |
+
nu
|
228 |
+
nü
|
229 |
+
nuan
|
230 |
+
nüe
|
231 |
+
nuo
|
232 |
+
nun
|
233 |
+
o
|
234 |
+
ou
|
235 |
+
pa
|
236 |
+
pai
|
237 |
+
pan
|
238 |
+
pang
|
239 |
+
pao
|
240 |
+
pei
|
241 |
+
pen
|
242 |
+
peng
|
243 |
+
pi
|
244 |
+
pian
|
245 |
+
piao
|
246 |
+
pie
|
247 |
+
pin
|
248 |
+
ping
|
249 |
+
po
|
250 |
+
pou
|
251 |
+
pu
|
252 |
+
qi
|
253 |
+
qia
|
254 |
+
qian
|
255 |
+
qiang
|
256 |
+
qiao
|
257 |
+
qie
|
258 |
+
qin
|
259 |
+
qing
|
260 |
+
qiong
|
261 |
+
qiu
|
262 |
+
qu
|
263 |
+
quan
|
264 |
+
que
|
265 |
+
qun
|
266 |
+
ran
|
267 |
+
rang
|
268 |
+
rao
|
269 |
+
re
|
270 |
+
ren
|
271 |
+
reng
|
272 |
+
ri
|
273 |
+
rong
|
274 |
+
rou
|
275 |
+
ru
|
276 |
+
ruan
|
277 |
+
rui
|
278 |
+
run
|
279 |
+
ruo
|
280 |
+
sa
|
281 |
+
sai
|
282 |
+
san
|
283 |
+
sang
|
284 |
+
sao
|
285 |
+
se
|
286 |
+
sen
|
287 |
+
seng
|
288 |
+
sha
|
289 |
+
shai
|
290 |
+
shan
|
291 |
+
shang
|
292 |
+
shao
|
293 |
+
she
|
294 |
+
shei
|
295 |
+
shen
|
296 |
+
sheng
|
297 |
+
shi
|
298 |
+
shou
|
299 |
+
shu
|
300 |
+
shua
|
301 |
+
shuai
|
302 |
+
shuan
|
303 |
+
shuang
|
304 |
+
shui
|
305 |
+
shun
|
306 |
+
shuo
|
307 |
+
si
|
308 |
+
song
|
309 |
+
sou
|
310 |
+
su
|
311 |
+
suan
|
312 |
+
sui
|
313 |
+
sun
|
314 |
+
suo
|
315 |
+
ta
|
316 |
+
tai
|
317 |
+
tan
|
318 |
+
tang
|
319 |
+
tao
|
320 |
+
te
|
321 |
+
teng
|
322 |
+
ti
|
323 |
+
tian
|
324 |
+
tiao
|
325 |
+
tie
|
326 |
+
ting
|
327 |
+
tong
|
328 |
+
tou
|
329 |
+
tu
|
330 |
+
tuan
|
331 |
+
tui
|
332 |
+
tun
|
333 |
+
tuo
|
334 |
+
wa
|
335 |
+
wai
|
336 |
+
wan
|
337 |
+
wang
|
338 |
+
wei
|
339 |
+
wen
|
340 |
+
weng
|
341 |
+
wo
|
342 |
+
wu
|
343 |
+
xi
|
344 |
+
xia
|
345 |
+
xian
|
346 |
+
xiang
|
347 |
+
xiao
|
348 |
+
xie
|
349 |
+
xin
|
350 |
+
xing
|
351 |
+
xiong
|
352 |
+
xiu
|
353 |
+
xu
|
354 |
+
xuan
|
355 |
+
xue
|
356 |
+
xun
|
357 |
+
ya
|
358 |
+
yan
|
359 |
+
yang
|
360 |
+
yao
|
361 |
+
ye
|
362 |
+
yi
|
363 |
+
yin
|
364 |
+
ying
|
365 |
+
yo
|
366 |
+
yong
|
367 |
+
you
|
368 |
+
yu
|
369 |
+
yuan
|
370 |
+
yue
|
371 |
+
yun
|
372 |
+
za
|
373 |
+
zai
|
374 |
+
zan
|
375 |
+
zang
|
376 |
+
zao
|
377 |
+
ze
|
378 |
+
zei
|
379 |
+
zen
|
380 |
+
zeng
|
381 |
+
zha
|
382 |
+
zhai
|
383 |
+
zhan
|
384 |
+
zhang
|
385 |
+
zhao
|
386 |
+
zhe
|
387 |
+
zhei
|
388 |
+
zhen
|
389 |
+
zheng
|
390 |
+
zhi
|
391 |
+
zhong
|
392 |
+
zhou
|
393 |
+
zhu
|
394 |
+
zhua
|
395 |
+
zhuai
|
396 |
+
zhuan
|
397 |
+
zhuang
|
398 |
+
zhui
|
399 |
+
zhun
|
400 |
+
zhuo
|
401 |
+
zi
|
402 |
+
zong
|
403 |
+
zou
|
404 |
+
zu
|
405 |
+
zuan
|
406 |
+
zui
|
407 |
+
zun
|
408 |
+
zuo
|
support_language.json
ADDED
@@ -0,0 +1,210 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"original_language":[
|
3 |
+
"Achinese (Arabic script)",
|
4 |
+
"Achinese (Latin script)",
|
5 |
+
"Afrikaans",
|
6 |
+
"Akan",
|
7 |
+
"Amharic",
|
8 |
+
"Arabic",
|
9 |
+
"Armenian",
|
10 |
+
"Assamese",
|
11 |
+
"Asturian",
|
12 |
+
"Awadhi",
|
13 |
+
"Balinese",
|
14 |
+
"Bambara",
|
15 |
+
"Banjar (Arabic script)",
|
16 |
+
"Banjar (Latin script)",
|
17 |
+
"Bashkir",
|
18 |
+
"Basque",
|
19 |
+
"Belarusian",
|
20 |
+
"Bemba",
|
21 |
+
"Bengali",
|
22 |
+
"Bhojpuri",
|
23 |
+
"Bosnian",
|
24 |
+
"Buginese",
|
25 |
+
"Bulgarian",
|
26 |
+
"Catalan",
|
27 |
+
"Cebuano",
|
28 |
+
"Central Aymara",
|
29 |
+
"Central Kurdish",
|
30 |
+
"Chhattisgarhi",
|
31 |
+
"Chinese",
|
32 |
+
"Chokwe",
|
33 |
+
"Crimean Tatar",
|
34 |
+
"Croatian",
|
35 |
+
"Czech",
|
36 |
+
"Danish",
|
37 |
+
"Dinka",
|
38 |
+
"Dutch",
|
39 |
+
"Dzongkha",
|
40 |
+
"Egyptian Arabic",
|
41 |
+
"English",
|
42 |
+
"Esperanto",
|
43 |
+
"Estonian",
|
44 |
+
"Ewe",
|
45 |
+
"Faroese",
|
46 |
+
"Fijian",
|
47 |
+
"Finnish",
|
48 |
+
"Fon",
|
49 |
+
"French",
|
50 |
+
"Friulian",
|
51 |
+
"Galician",
|
52 |
+
"German",
|
53 |
+
"Greek",
|
54 |
+
"Guarani",
|
55 |
+
"Gujarati",
|
56 |
+
"Haitian Creole",
|
57 |
+
"Hausa",
|
58 |
+
"Hebrew",
|
59 |
+
"Hindi",
|
60 |
+
"Hungarian",
|
61 |
+
"Icelandic",
|
62 |
+
"Igbo",
|
63 |
+
"Iloko",
|
64 |
+
"Indonesian",
|
65 |
+
"Irish",
|
66 |
+
"Italian",
|
67 |
+
"Japanese",
|
68 |
+
"Javanese",
|
69 |
+
"Jula",
|
70 |
+
"Kabyle",
|
71 |
+
"Kachin",
|
72 |
+
"Kazakh",
|
73 |
+
"Khmer",
|
74 |
+
"Korean",
|
75 |
+
"Lithuanian",
|
76 |
+
"Malayalam",
|
77 |
+
"Marathi",
|
78 |
+
"Mesopotamian Arabic",
|
79 |
+
"Moroccan Arabic",
|
80 |
+
"Najdi Arabic",
|
81 |
+
"Nepali",
|
82 |
+
"Nigerian Fulfulde",
|
83 |
+
"North Azerbaijani",
|
84 |
+
"North Levantine Arabic",
|
85 |
+
"Persian",
|
86 |
+
"Polish",
|
87 |
+
"Portuguese",
|
88 |
+
"Russian",
|
89 |
+
"Scottish Gaelic",
|
90 |
+
"Sinhala",
|
91 |
+
"South Azerbaijani",
|
92 |
+
"South Levantine Arabic",
|
93 |
+
"Spanish",
|
94 |
+
"Standard Arabic",
|
95 |
+
"Ta'izzi-Adeni Arabic",
|
96 |
+
"Tamil",
|
97 |
+
"Thai",
|
98 |
+
"Tibetan",
|
99 |
+
"Tunisian Arabic",
|
100 |
+
"Turkish",
|
101 |
+
"Ukrainian",
|
102 |
+
"Urdu",
|
103 |
+
"Vietnamese",
|
104 |
+
"Welsh"
|
105 |
+
],
|
106 |
+
"target_language":[
|
107 |
+
"Achinese (Arabic script)",
|
108 |
+
"Achinese (Latin script)",
|
109 |
+
"Afrikaans",
|
110 |
+
"Akan",
|
111 |
+
"Amharic",
|
112 |
+
"Arabic",
|
113 |
+
"Armenian",
|
114 |
+
"Assamese",
|
115 |
+
"Asturian",
|
116 |
+
"Awadhi",
|
117 |
+
"Balinese",
|
118 |
+
"Bambara",
|
119 |
+
"Banjar (Arabic script)",
|
120 |
+
"Banjar (Latin script)",
|
121 |
+
"Bashkir",
|
122 |
+
"Basque",
|
123 |
+
"Belarusian",
|
124 |
+
"Bemba",
|
125 |
+
"Bengali",
|
126 |
+
"Bhojpuri",
|
127 |
+
"Bosnian",
|
128 |
+
"Buginese",
|
129 |
+
"Bulgarian",
|
130 |
+
"Catalan",
|
131 |
+
"Cebuano",
|
132 |
+
"Central Aymara",
|
133 |
+
"Central Kurdish",
|
134 |
+
"Chhattisgarhi",
|
135 |
+
"Chinese",
|
136 |
+
"Chokwe",
|
137 |
+
"Crimean Tatar",
|
138 |
+
"Croatian",
|
139 |
+
"Czech",
|
140 |
+
"Danish",
|
141 |
+
"Dinka",
|
142 |
+
"Dutch",
|
143 |
+
"Dzongkha",
|
144 |
+
"Egyptian Arabic",
|
145 |
+
"English",
|
146 |
+
"Esperanto",
|
147 |
+
"Estonian",
|
148 |
+
"Ewe",
|
149 |
+
"Faroese",
|
150 |
+
"Fijian",
|
151 |
+
"Finnish",
|
152 |
+
"Fon",
|
153 |
+
"French",
|
154 |
+
"Friulian",
|
155 |
+
"Galician",
|
156 |
+
"German",
|
157 |
+
"Greek",
|
158 |
+
"Guarani",
|
159 |
+
"Gujarati",
|
160 |
+
"Haitian Creole",
|
161 |
+
"Hausa",
|
162 |
+
"Hebrew",
|
163 |
+
"Hindi",
|
164 |
+
"Hungarian",
|
165 |
+
"Icelandic",
|
166 |
+
"Igbo",
|
167 |
+
"Iloko",
|
168 |
+
"Indonesian",
|
169 |
+
"Irish",
|
170 |
+
"Italian",
|
171 |
+
"Japanese",
|
172 |
+
"Javanese",
|
173 |
+
"Jula",
|
174 |
+
"Kabyle",
|
175 |
+
"Kachin",
|
176 |
+
"Kazakh",
|
177 |
+
"Khmer",
|
178 |
+
"Korean",
|
179 |
+
"Lithuanian",
|
180 |
+
"Malayalam",
|
181 |
+
"Marathi",
|
182 |
+
"Mesopotamian Arabic",
|
183 |
+
"Moroccan Arabic",
|
184 |
+
"Najdi Arabic",
|
185 |
+
"Nepali",
|
186 |
+
"Nigerian Fulfulde",
|
187 |
+
"North Azerbaijani",
|
188 |
+
"North Levantine Arabic",
|
189 |
+
"Persian",
|
190 |
+
"Polish",
|
191 |
+
"Portuguese",
|
192 |
+
"Russian",
|
193 |
+
"Scottish Gaelic",
|
194 |
+
"Sinhala",
|
195 |
+
"South Azerbaijani",
|
196 |
+
"South Levantine Arabic",
|
197 |
+
"Spanish",
|
198 |
+
"Standard Arabic",
|
199 |
+
"Ta'izzi-Adeni Arabic",
|
200 |
+
"Tamil",
|
201 |
+
"Thai",
|
202 |
+
"Tibetan",
|
203 |
+
"Tunisian Arabic",
|
204 |
+
"Turkish",
|
205 |
+
"Ukrainian",
|
206 |
+
"Urdu",
|
207 |
+
"Vietnamese",
|
208 |
+
"Welsh"
|
209 |
+
]
|
210 |
+
}
|