danielsteinigen
commited on
Commit
•
6a58313
1
Parent(s):
127e6ed
add model files
Browse files- README.md +284 -3
- config.json +37 -0
- generation_config.json +8 -0
- gptx_tokenizer.py +435 -0
- model-00001-of-00004.safetensors +3 -0
- model-00002-of-00004.safetensors +3 -0
- model-00003-of-00004.safetensors +3 -0
- model-00004-of-00004.safetensors +3 -0
- model.safetensors.index.json +297 -0
- special_tokens_map.json +264 -0
- tokenizer.model +3 -0
- tokenizer_config.json +292 -0
README.md
CHANGED
@@ -1,3 +1,284 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- de
|
4 |
+
- bg
|
5 |
+
- cs
|
6 |
+
- da
|
7 |
+
- el
|
8 |
+
- en
|
9 |
+
- es
|
10 |
+
- et
|
11 |
+
- fi
|
12 |
+
- fr
|
13 |
+
- ga
|
14 |
+
- hr
|
15 |
+
- hu
|
16 |
+
- it
|
17 |
+
- lt
|
18 |
+
- lv
|
19 |
+
- mt
|
20 |
+
- nl
|
21 |
+
- pl
|
22 |
+
- pt
|
23 |
+
- ro
|
24 |
+
- sl
|
25 |
+
- sv
|
26 |
+
- sk
|
27 |
+
metrics:
|
28 |
+
- accuracy
|
29 |
+
- bleu
|
30 |
+
pipeline_tag: text-generation
|
31 |
+
library_name: transformers
|
32 |
+
base_model:
|
33 |
+
- openGPT-X/Teuken-7B-base-v0.4
|
34 |
+
---
|
35 |
+
# Model Card for Teuken-7B-instruct-v0.4
|
36 |
+
|
37 |
+
|
38 |
+
[Teuken-7B-base-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.4) is a 7B parameter multilingual large language model (LLM) pre-trained with 4T tokens within the research project OpenGPT-X.
|
39 |
+
Teuken-7B-instruct-v0.4 is an instruction-tuned version of [Teuken-7B-base-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.4).
|
40 |
+
|
41 |
+
|
42 |
+
### Model Description
|
43 |
+
|
44 |
+
<!-- Provide a longer summary of what this model is. -->
|
45 |
+
|
46 |
+
- **Developed by:** Fraunhofer, Forschungszentrum Jülich, TU Dresden, DFKI
|
47 |
+
- **Funded by:** German Federal Ministry of Economics and Climate Protection (BMWK) in the context of the OpenGPT-X project
|
48 |
+
- **Model type:** Transformer based decoder-only model
|
49 |
+
- **Language(s) (NLP):** bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv
|
50 |
+
- **Shared by:** OpenGPT-X
|
51 |
+
|
52 |
+
## Uses
|
53 |
+
|
54 |
+
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
55 |
+
Teuken-7B-instruct-v0.4 is intended for commercial and research use in all official 24 European languages. Since Teuken-7B-instruct-v0.4 focuses on covering all 24 EU languages, it renders more stable results across these languages and better reflects European values in its answers than English-centric models. It is therefore specialized for use in multilingual tasks.
|
56 |
+
|
57 |
+
## Disclaimer Toxic Content:
|
58 |
+
|
59 |
+
This Language Model (LLM) may generate content that is inappropriate, offensive, or harmful. While the dataset has been heavily filtered to minimize such outputs, the model may still produce text that is biased or toxic due to the large scale and diverse nature of the data.
|
60 |
+
|
61 |
+
|
62 |
+
### Out-of-Scope Use
|
63 |
+
|
64 |
+
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
65 |
+
|
66 |
+
The model is not intended for use in math and coding tasks.
|
67 |
+
|
68 |
+
## Bias, Risks, and Limitations
|
69 |
+
|
70 |
+
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
71 |
+
|
72 |
+
Teuken-7B-instruct-v0.4 is an instruction-tuned version of [Teuken-7B-base-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.4) that is not completely free from biases and hallucinations.
|
73 |
+
|
74 |
+
## How to Get Started with the Model
|
75 |
+
|
76 |
+
## Usage
|
77 |
+
The model requires transformers, sentencepiece, and the torch library.
|
78 |
+
After installation, here's an example of how to use the model:
|
79 |
+
|
80 |
+
The prompt template for the fine-tuned model is defined as follows:
|
81 |
+
```python
|
82 |
+
user="Hi!"
|
83 |
+
lang_code = "DE"
|
84 |
+
system_messages={
|
85 |
+
"EN": "A chat between a human and an artificial intelligence assistant."
|
86 |
+
" The assistant gives helpful and polite answers to the human's questions.",
|
87 |
+
"DE": "Ein Gespräch zwischen einem Menschen und einem Assistenten mit künstlicher Intelligenz."
|
88 |
+
" Der Assistent gibt hilfreiche und höfliche Antworten auf die Fragen des Menschen.",
|
89 |
+
}
|
90 |
+
|
91 |
+
prompt = f"System: {system_messages[lang_code]}\nUser: {user}\nAssistant:<s>"
|
92 |
+
```
|
93 |
+
|
94 |
+
```python
|
95 |
+
import torch
|
96 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
97 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
98 |
+
model_name = "openGPT-X/Teuken-7B-instruct-v0.4"
|
99 |
+
model = AutoModelForCausalLM.from_pretrained(
|
100 |
+
model_name,
|
101 |
+
trust_remote_code=True,
|
102 |
+
torch_dtype=torch.bfloat16,
|
103 |
+
attn_implementation="flash_attention_2",
|
104 |
+
)
|
105 |
+
model = model.to(device).eval()
|
106 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
107 |
+
model_name,
|
108 |
+
use_fast=False,
|
109 |
+
trust_remote_code=True,
|
110 |
+
)
|
111 |
+
messages = [{"role": "User", "content": "Wer bist du?"}]
|
112 |
+
prompt_ids = tokenizer.apply_chat_template(messages, chat_template="DE", tokenize=True, add_generation_prompt=True, return_tensors="pt")
|
113 |
+
prediction = model.generate(
|
114 |
+
prompt_ids.to(model.device),
|
115 |
+
max_length=512,
|
116 |
+
do_sample=True,
|
117 |
+
top_k=50,
|
118 |
+
top_p=0.95,
|
119 |
+
temperature=0.7,
|
120 |
+
num_return_sequences=1,
|
121 |
+
)
|
122 |
+
prediction_text = tokenizer.decode(prediction[0])
|
123 |
+
print(prediction_text)
|
124 |
+
```
|
125 |
+
|
126 |
+
This example demonstrates how to load the model and tokenizer, prepare input, generate text, and print the result.
|
127 |
+
|
128 |
+
## Training Details
|
129 |
+
|
130 |
+
### Pre-Training Data
|
131 |
+
|
132 |
+
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
133 |
+
|
134 |
+
[Teuken-7B-base-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.4) was pre-trained on 4 trillion tokens of data from publicly available sources.
|
135 |
+
The pretraining data has a cutoff of September 2023.
|
136 |
+
More information are available in our [preprint](http://arxiv.org/abs/2410.08800).
|
137 |
+
|
138 |
+
|
139 |
+
### Instruction-Tuning Data
|
140 |
+
|
141 |
+
### English
|
142 |
+
|
143 |
+
| Dataset file | Sample Count |
|
144 |
+
| ----------------------------------------------------- | ------------ |
|
145 |
+
| en/bactrianx_EN_fastchat.jsonl | 66985 |
|
146 |
+
| en/code_alpaca_fastchat.jsonl | 19990 |
|
147 |
+
| en/evol_instruct_143k_fastchat.jsonl | 142968 |
|
148 |
+
| en/evol_instruct_70k_fastchat.jsonl | 69968 |
|
149 |
+
| en/lmsys_chat_1m_high_quality_train_en_fastchat.jsonl | 18651 |
|
150 |
+
| en/open_orca_fastchat_aa.jsonl | 599968 |
|
151 |
+
| en/open_orca_fastchat_ab.jsonl | 599968 |
|
152 |
+
| en/open_orca_fastchat_ac.jsonl | 599968 |
|
153 |
+
| en/open_orca_fastchat_ad.jsonl | 599968 |
|
154 |
+
| en/open_orca_fastchat_ag.jsonl | 599968 |
|
155 |
+
| en/open_orca_fastchat_ah.jsonl | 33891 |
|
156 |
+
| en/sharegpt_v3_unfiltered_fastchat.jsonl | 93880 |
|
157 |
+
| en/ultrachat_200k_fastchat.jsonl | 11525 |
|
158 |
+
| **total** | **3457698** |
|
159 |
+
|
160 |
+
### German
|
161 |
+
|
162 |
+
| Dataset file | Sample Count |
|
163 |
+
| ----------------------------------------------------------- | ------------ |
|
164 |
+
| de/bactrianx_DE_fastchat.jsonl | 67017 |
|
165 |
+
| de/freedomintelligence_alpaca_gpt4_deutsch_fastchat.jsonl | 49969 |
|
166 |
+
| de/freedomintelligence_evol_instruct_deutsch_fastchat.jsonl | 59022 |
|
167 |
+
| de/freedomintelligence_sharegpt_deutsch_fastchat.jsonl | 6101 |
|
168 |
+
| de/german_poems_fastchat.jsonl | 400 |
|
169 |
+
| de/german_songs_fastchat.jsonl | 1000 |
|
170 |
+
| de/ultrachat_de_1k_fastchat.jsonl | 959 |
|
171 |
+
| **total** | **184468** |
|
172 |
+
|
173 |
+
### Training Procedure
|
174 |
+
|
175 |
+
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
176 |
+
Instruction fined tuned version of [Teuken-7B-base-v0.4](https://huggingface.co/openGPT-X/Teuken-7B-base-v0.4).
|
177 |
+
|
178 |
+
|
179 |
+
#### Training Hyperparameters
|
180 |
+
|
181 |
+
- **Training regime:** bf16 mixed precision <!--fp32, fp16 mixed precision, , bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
|
182 |
+
|
183 |
+
## Evaluation
|
184 |
+
|
185 |
+
<!-- This section describes the evaluation protocols and provides the results. -->
|
186 |
+
|
187 |
+
### Testing Data, Factors & Metrics
|
188 |
+
|
189 |
+
#### Testing Data
|
190 |
+
|
191 |
+
<!-- This should link to a Dataset Card if possible. -->
|
192 |
+
|
193 |
+
The model was evaluated in 21 languages on ARC, GSM8K, HellaSwag, TruthfulQA, Translation and MMLU. Results can be seen in the [European LLM Leaderboard](https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard).
|
194 |
+
|
195 |
+
## Technical Specifications
|
196 |
+
|
197 |
+
### Model Architecture and Objective
|
198 |
+
|
199 |
+
| Hyper-Parameter | Value |
|
200 |
+
|----------------------------|----------|
|
201 |
+
| Training Objective | CLM |
|
202 |
+
| Activation Function | SwiGLU |
|
203 |
+
| Seq Length | 4096 |
|
204 |
+
| Position Embeddings | Rotary |
|
205 |
+
| Num Layers | 32 |
|
206 |
+
| Hidden Size | 4096 |
|
207 |
+
| FFN Hidden Size | 13440 |
|
208 |
+
| Num Attention Heads | 32 |
|
209 |
+
| Head Dim | 128 |
|
210 |
+
| Group Query Attention | yes |
|
211 |
+
| Num Query Groups | 2 |
|
212 |
+
| Normalization | RMSNorm |
|
213 |
+
| Learning rate | 3e-4 |
|
214 |
+
| Min learning rate | 3e-5 |
|
215 |
+
| Disable bias in linear | yes |
|
216 |
+
| Hidden dropout | 0.0 |
|
217 |
+
| Attention dropout | 0.0 |
|
218 |
+
| Optimizer | AdamW |
|
219 |
+
| Beta1 | 0.9 |
|
220 |
+
| Beta2 | 0.95 |
|
221 |
+
| Sequence-parallelism
|
222 |
+
| Data-type | bf16 |
|
223 |
+
| Recompute-activations | yes |
|
224 |
+
| Distributed-optimizers | yes |
|
225 |
+
| Model Initialization | |
|
226 |
+
|
227 |
+
### Compute Infrastructure
|
228 |
+
|
229 |
+
We trained our models on JUWELS Booster which consists of 936 compute nodes, each equipped with 4 NVIDIA A100 GPUs. The GPUs are hosted by AMD EPYC Rome CPUs. The compute nodes are connected with HDR-200 InfiniBand in a DragonFly+ topology.
|
230 |
+
|
231 |
+
#### Hardware
|
232 |
+
|
233 |
+
The configuration of JUWELS Booster compute nodes is the following:
|
234 |
+
|
235 |
+
CPU: AMD EPYC 7402 processor; 2 sockets, 24 cores per socket, SMT-2 (total: 2×24×2 = 96 threads) in NPS-4 1 configuration
|
236 |
+
Memory: 512 GB DDR4-3200 RAM (of which at least 20 GB is taken by the system software stack, including the file system); 256 GB per socket; 8 memory channels per socket (2 channels per NUMA domain)
|
237 |
+
GPU: 4 × NVIDIA A100 Tensor Core GPU with 40 GB; connected via NVLink3 to each other
|
238 |
+
Network: 4 × Mellanox HDR200 InfiniBand ConnectX 6 (200 Gbit/s each), HCA
|
239 |
+
Periphery: CPU, GPU, and network adapter are connected via 2 PCIe Gen 4 switches with 16 PCIe lanes going to each device (CPU socket: 2×16 lanes). PCIe switches are configured in synthetic mode.
|
240 |
+
#### Software
|
241 |
+
|
242 |
+
[Megatron-LM](https://github.com/OpenGPTX/Megatron-LM)
|
243 |
+
|
244 |
+
**BibTeX:**
|
245 |
+
|
246 |
+
If you find our model useful in your research, please consider citing our [preprint](https://arxiv.org/abs/2410.03730):
|
247 |
+
```
|
248 |
+
@misc{ali2024teuken7bbaseteuken7binstructeuropean,
|
249 |
+
title={Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs},
|
250 |
+
author={Mehdi Ali and Michael Fromm and Klaudia Thellmann and Jan Ebert and Alexander Arno Weber and Richard Rutmann and Charvi Jain and Max Lübbering and Daniel Steinigen and Johannes Leveling and Katrin Klug and Jasper Schulze Buschhoff and Lena Jurkschat and Hammam Abdelwahab and Benny Jörg Stein and Karl-Heinz Sylla and Pavel Denisov and Nicolo' Brandizzi and Qasid Saleem and Anirban Bhowmick and Lennard Helmer and Chelsea John and Pedro Ortiz Suarez and Malte Ostendorff and Alex Jude and Lalith Manjunath and Samuel Weinbach and Carolin Penke and Oleg Filatov and Shima Asaadi and Fabio Barth and Rafet Sifa and Fabian Küch and Andreas Herten and René Jäkel and Georg Rehm and Stefan Kesselheim and Joachim Köhler and Nicolas Flores-Herr},
|
251 |
+
year={2024},
|
252 |
+
eprint={2410.03730},
|
253 |
+
archivePrefix={arXiv},
|
254 |
+
primaryClass={cs.CL},
|
255 |
+
url={https://arxiv.org/abs/2410.03730},
|
256 |
+
}
|
257 |
+
```
|
258 |
+
|
259 |
+
# Team
|
260 |
+
## Data Team
|
261 |
+
Anirban Bhowmick (IAIS), Nicolo Brandizzi (IAIS), Lennard Helmer (IAIS), Benny Jörg Stein (IAIS), Karl-Heinz Sylla (IAIS), Pavel Denisov (IAIS), Qasid Saleem (IAIS), Johannes Leveling (IAIS), Hammam Abdelwahab (IAIS), Luzian Hahn (IIS), Farzad Naderi (IIS), Md Saiful Islam (IIS), Alexander Schwirjow (IIS), Pedro Ortiz Suarez (ex. DFKI), Malte Ostendorff (ex. DFKI)
|
262 |
+
## Model-Training Team
|
263 |
+
### Core contributors
|
264 |
+
Mehdi Ali (IAIS), Michael Fromm (IAIS), Jan Ebert (FZJ), Chelsea John (FZJ), Lena Jurkschat (TUD), Alexander Weber (IAIS)
|
265 |
+
### Contributors:
|
266 |
+
Richard Rutmann (IAIS), Daniel Steinigen (IAIS), Lalith Manjunath (TUD), Carolin Penke (FZJ)
|
267 |
+
## Evaluation Team
|
268 |
+
### Core contributors
|
269 |
+
Klaudia Thellmann (TUD), Alex Jude (IAIS), Jasper Buschhoff (IAIS)
|
270 |
+
### Contributors:
|
271 |
+
Shima Assadi (IIS), Fabio Barth (DFKI)
|
272 |
+
## Management
|
273 |
+
Joachim Köhler (IAIS), Nicolas Flores-Herr (IAIS), Stefan Kesselheim (FZJ), Andreas Herten (FZJ), Georg Rehm (DFKI), René Jäkel (TUD), Fabian Küch (IIS), Nicole Hildebrandt (IAIS), Ines Wendler (IAIS)
|
274 |
+
|
275 |
+
<div class="hf-card">
|
276 |
+
<h2>Contact Information</h2>
|
277 |
+
<p>You can reach out to the following model card contact:</p>
|
278 |
+
<ul>
|
279 |
+
<li>
|
280 |
+
<a href="https://huggingface.co/openGPT-X" target="_blank">OpenGPT-X</a>
|
281 |
+
- <a href="[email protected]">[email protected]</a>
|
282 |
+
</li>
|
283 |
+
</ul>
|
284 |
+
</div>
|
config.json
ADDED
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "/home/Shared/align/models/teuken-instruct/7B_EU24_4T_EU24/train_run_24EU_teuken_commercial_EU24_lr_1e-5_2024_10_05_18_54_27/opengptx_checkpoints/checkpoint-405",
|
3 |
+
"architectures": [
|
4 |
+
"LlamaForCausalLM"
|
5 |
+
],
|
6 |
+
"attention_bias": false,
|
7 |
+
"attention_dropout": 0.0,
|
8 |
+
"auto_map": {
|
9 |
+
"AutoTokenizer": [
|
10 |
+
"gptx_tokenizer.SPTokenizer",
|
11 |
+
null
|
12 |
+
]
|
13 |
+
},
|
14 |
+
"bos_token_id": 1,
|
15 |
+
"eos_token_id": 2,
|
16 |
+
"hidden_act": "silu",
|
17 |
+
"hidden_size": 4096,
|
18 |
+
"initializer_range": 0.0158,
|
19 |
+
"intermediate_size": 13440,
|
20 |
+
"max_position_embeddings": 4096,
|
21 |
+
"mlp_bias": false,
|
22 |
+
"model_type": "llama",
|
23 |
+
"num_attention_heads": 32,
|
24 |
+
"num_hidden_layers": 32,
|
25 |
+
"num_key_value_heads": 2,
|
26 |
+
"pad_token_id": 3,
|
27 |
+
"pretraining_tp": 1,
|
28 |
+
"rms_norm_eps": 1e-05,
|
29 |
+
"rope_scaling": null,
|
30 |
+
"rope_theta": 10000.0,
|
31 |
+
"tie_word_embeddings": true,
|
32 |
+
"tokenizer_class": "SPTokenizer",
|
33 |
+
"torch_dtype": "bfloat16",
|
34 |
+
"transformers_version": "4.43.2",
|
35 |
+
"use_cache": false,
|
36 |
+
"vocab_size": 250680
|
37 |
+
}
|
generation_config.json
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_from_model_config": true,
|
3 |
+
"bos_token_id": 1,
|
4 |
+
"eos_token_id": 2,
|
5 |
+
"pad_token_id": 3,
|
6 |
+
"transformers_version": "4.43.2",
|
7 |
+
"use_cache": false
|
8 |
+
}
|
gptx_tokenizer.py
ADDED
@@ -0,0 +1,435 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from __future__ import annotations
|
2 |
+
|
3 |
+
import json
|
4 |
+
import os
|
5 |
+
import warnings
|
6 |
+
from pathlib import Path
|
7 |
+
from typing import Any, Dict, List, Mapping, Optional, Tuple, Union
|
8 |
+
|
9 |
+
import sentencepiece as spm
|
10 |
+
from huggingface_hub import hf_hub_download, list_repo_files
|
11 |
+
from transformers.tokenization_utils import PreTrainedTokenizer
|
12 |
+
from transformers.tokenization_utils_base import TOKENIZER_CONFIG_FILE
|
13 |
+
|
14 |
+
# Define special tokens used in the tokenizer
|
15 |
+
EOD_TOKEN = "<eod>"
|
16 |
+
PAD_TOKEN = "<pad>"
|
17 |
+
BOS_TOKEN = "<s>"
|
18 |
+
EOS_TOKEN = "</s>"
|
19 |
+
UNK_TOKEN = "<unk>"
|
20 |
+
REPO_ID = "openGPT-X/Teuken-7B-instruct-commercial-v0.4"
|
21 |
+
|
22 |
+
class HFGPTXTokenizer(PreTrainedTokenizer):
|
23 |
+
"""
|
24 |
+
A custom tokenizer class that extends Hugging Face's PreTrainedTokenizer.
|
25 |
+
It is specifically designed to work with SentencePiece models and integrates
|
26 |
+
with Hugging Face's tokenizer utilities.
|
27 |
+
"""
|
28 |
+
|
29 |
+
model_file_glob = "*tokenizer.json"
|
30 |
+
vocab_files_names = {"tokenizer_file": "tokenizer.json"}
|
31 |
+
decode_kwargs: List[str] = []
|
32 |
+
|
33 |
+
def _encode(self, text: str, return_tokens: bool = False, is_continuation: bool = False):
|
34 |
+
"""
|
35 |
+
Encode a given text using the tokenizer.
|
36 |
+
|
37 |
+
Args:
|
38 |
+
text (str): The text to encode.
|
39 |
+
return_tokens (bool): If True, returns token strings instead of token IDs.
|
40 |
+
is_continuation (bool): If True, uses a continuation tokenizer (if available).
|
41 |
+
Returns:
|
42 |
+
List[int] or List[str]: Encoded text as a list of token IDs or token strings.
|
43 |
+
"""
|
44 |
+
assert self.tok is not None, "No tokenizer is currently loaded"
|
45 |
+
|
46 |
+
# Variant with additional sp processor:
|
47 |
+
tokenizer = self.continuation_tokenizer if is_continuation else self.tok
|
48 |
+
|
49 |
+
if return_tokens:
|
50 |
+
return tokenizer.encode_as_pieces(text)
|
51 |
+
else:
|
52 |
+
return tokenizer.encode(text)
|
53 |
+
|
54 |
+
def create_list_of_special_tokens(self) -> List[str]:
|
55 |
+
"""
|
56 |
+
Create a list of special tokens, including the BOS, EOS, PAD, EOD tokens,
|
57 |
+
and 256 additional placeholder tokens.
|
58 |
+
Returns:
|
59 |
+
List[str]: List of special tokens.
|
60 |
+
"""
|
61 |
+
return [self.bos_token, self.eos_token, self.pad_token, self.eod_token] + [
|
62 |
+
f"<placeholder_tok_{i}>" for i in range(256)
|
63 |
+
]
|
64 |
+
|
65 |
+
def find_tokenizer_config(self, config_path: Path, repo_id: str = None) -> Path:
|
66 |
+
if repo_id is None:
|
67 |
+
raise ValueError("repo_id must be provided if config_path is not a local file")
|
68 |
+
|
69 |
+
try:
|
70 |
+
# List all files in the repo
|
71 |
+
repo_files = list_repo_files(repo_id)
|
72 |
+
|
73 |
+
# Find the tokenizer config file
|
74 |
+
tokenizer_files = [f for f in repo_files if f.endswith('tokenizer_config.json')]
|
75 |
+
if not tokenizer_files:
|
76 |
+
raise FileNotFoundError(f"No tokenizer_config.json file found in repository {repo_id}")
|
77 |
+
|
78 |
+
# Use the first tokenizer_config.json file found
|
79 |
+
tokenizer_config_file = tokenizer_files[0]
|
80 |
+
print(f"Found tokenizer config file: {tokenizer_config_file}")
|
81 |
+
|
82 |
+
# Download the file
|
83 |
+
tokenizer_config_file_or_name = hf_hub_download(repo_id=repo_id, filename=tokenizer_config_file)
|
84 |
+
print(f"Downloaded tokenizer config file to: {tokenizer_config_file_or_name}")
|
85 |
+
return tokenizer_config_file_or_name
|
86 |
+
except Exception as e:
|
87 |
+
raise OSError(f"Failed to download tokenizer model: {str(e)}")
|
88 |
+
|
89 |
+
def instantiate_from_file_or_name(self, model_file_or_name: str, repo_id: str = None):
|
90 |
+
"""
|
91 |
+
Load the tokenizer model from a file or download it from a repository.
|
92 |
+
Args:
|
93 |
+
model_file_or_name (str): Path to the model file or the model name.
|
94 |
+
repo_id (str, optional): Repository ID from which to download the model file.
|
95 |
+
Returns:
|
96 |
+
spm.SentencePieceProcessor: Loaded SentencePieceProcessor instance.
|
97 |
+
Raises:
|
98 |
+
ValueError: If repo_id is not provided when model_file_or_name is not a file.
|
99 |
+
OSError: If the model file cannot be loaded or downloaded.
|
100 |
+
"""
|
101 |
+
if not os.path.isfile(model_file_or_name):
|
102 |
+
if repo_id is None:
|
103 |
+
raise ValueError("repo_id must be provided if model_file_or_name is not a local file")
|
104 |
+
|
105 |
+
try:
|
106 |
+
# List all files in the repo
|
107 |
+
repo_files = list_repo_files(repo_id)
|
108 |
+
|
109 |
+
# Find the tokenizer model file
|
110 |
+
tokenizer_files = [f for f in repo_files if f.endswith('.model')]
|
111 |
+
if not tokenizer_files:
|
112 |
+
raise FileNotFoundError(f"No .model file found in repository {repo_id}")
|
113 |
+
|
114 |
+
# Use the first .model file found
|
115 |
+
model_file = tokenizer_files[0]
|
116 |
+
print(f"Found tokenizer model file: {model_file}")
|
117 |
+
|
118 |
+
# Download the file
|
119 |
+
model_file_or_name = hf_hub_download(repo_id=repo_id, filename=model_file)
|
120 |
+
print(f"Downloaded tokenizer model to: {model_file_or_name}")
|
121 |
+
except Exception as e:
|
122 |
+
raise OSError(f"Failed to download tokenizer model: {str(e)}")
|
123 |
+
|
124 |
+
try:
|
125 |
+
return spm.SentencePieceProcessor(model_file=model_file_or_name)
|
126 |
+
except Exception as e:
|
127 |
+
raise OSError(f"Failed to load tokenizer model: {str(e)}")
|
128 |
+
|
129 |
+
def __init__(
|
130 |
+
self,
|
131 |
+
model_path: Optional[str] = None,
|
132 |
+
config_path: Optional[str] = None,
|
133 |
+
**kwargs: Any,
|
134 |
+
) -> None:
|
135 |
+
"""
|
136 |
+
Initialize the tokenizer.
|
137 |
+
Args:
|
138 |
+
model_path (Optional[str]): Path to the tokenizer model file.
|
139 |
+
config_path (Optional[str]): Path to the tokenizer configuration file.
|
140 |
+
**kwargs: Additional keyword arguments passed to the superclass.
|
141 |
+
This method also ensures backward compatibility by setting
|
142 |
+
`clean_up_tokenization_spaces` to False by default.
|
143 |
+
"""
|
144 |
+
# Prevent cleanup of tokenization spaces to maintain backward compatibility
|
145 |
+
self.clean_up_tokenization_spaces = kwargs.setdefault("clean_up_tokenization_spaces", False)
|
146 |
+
self.vocab = None
|
147 |
+
cp_path = kwargs.get("name_or_path", ".")
|
148 |
+
if model_path is None:
|
149 |
+
model_path = str(Path(cp_path) / self.vocab_files_names["tokenizer_file"])
|
150 |
+
self.tok = self.instantiate_from_file_or_name(model_path, repo_id=REPO_ID)
|
151 |
+
|
152 |
+
super().__init__(**kwargs)
|
153 |
+
|
154 |
+
# Specify special tokens which we know the value of.
|
155 |
+
# EOD from `tok` is used as what is called EOS in HuggingFace.
|
156 |
+
# Since there is no corresponding mapping for EOS from `tok` in
|
157 |
+
# HuggingFace, it is treated as an additional special token.
|
158 |
+
# Same for all other special tokens.
|
159 |
+
self.eos_token = EOD_TOKEN
|
160 |
+
self.bos_token = BOS_TOKEN
|
161 |
+
self.pad_token = PAD_TOKEN
|
162 |
+
|
163 |
+
if not self.additional_special_tokens:
|
164 |
+
self.additional_special_tokens = [
|
165 |
+
token
|
166 |
+
for token in self.create_list_of_special_tokens()
|
167 |
+
# Filter out the special tokens we added manually.
|
168 |
+
if token
|
169 |
+
not in [
|
170 |
+
self.eos_token,
|
171 |
+
self.bos_token,
|
172 |
+
self.pad_token,
|
173 |
+
]
|
174 |
+
]
|
175 |
+
if config_path is None:
|
176 |
+
config_path = str(Path(cp_path) / TOKENIZER_CONFIG_FILE)
|
177 |
+
|
178 |
+
if os.path.isfile(config_path):
|
179 |
+
self.tokenizer_config = self.load_json(Path(config_path))
|
180 |
+
else: # Load from repo
|
181 |
+
self.tokenizer_config = self.load_json(Path(self.find_tokenizer_config(Path(config_path), repo_id=REPO_ID)))
|
182 |
+
|
183 |
+
@property
|
184 |
+
def vocab_size(self) -> int:
|
185 |
+
"""
|
186 |
+
Get the size of the tokenizer vocabulary.
|
187 |
+
Returns:
|
188 |
+
int: The size of the vocabulary.
|
189 |
+
"""
|
190 |
+
return self.tok.GetPieceSize()
|
191 |
+
|
192 |
+
def get_vocab(self) -> Dict[str, int]:
|
193 |
+
"""
|
194 |
+
Get the vocabulary as a dictionary mapping token strings to their IDs.
|
195 |
+
Returns:
|
196 |
+
Dict[str, int]: Vocabulary mapping.
|
197 |
+
"""
|
198 |
+
if self.vocab is None:
|
199 |
+
self.vocab = {self.tok.IdToPiece(i): i for i in range(self.vocab_size)}
|
200 |
+
return self.vocab
|
201 |
+
|
202 |
+
def _tokenize(self, text: str, **kwargs) -> List[int]:
|
203 |
+
"""
|
204 |
+
Tokenize the input text.
|
205 |
+
Args:
|
206 |
+
text (str): Text to tokenize.
|
207 |
+
**kwargs: Additional keyword arguments.
|
208 |
+
Returns:
|
209 |
+
List[int]: List of token IDs.
|
210 |
+
"""
|
211 |
+
return_tokens = kwargs.pop("return_tokens", True)
|
212 |
+
return self._encode(text, return_tokens=return_tokens, **kwargs)
|
213 |
+
|
214 |
+
def _convert_token_to_id(self, token: str) -> int:
|
215 |
+
"""
|
216 |
+
Convert a token string to its corresponding ID.
|
217 |
+
Args:
|
218 |
+
token (str): The token to convert.
|
219 |
+
Returns:
|
220 |
+
int: The token's ID.
|
221 |
+
Raises:
|
222 |
+
ValueError: If the token is unknown and cannot be encoded to a single ID.
|
223 |
+
"""
|
224 |
+
return self.tok.PieceToId(token)
|
225 |
+
|
226 |
+
|
227 |
+
def decode(
|
228 |
+
self,
|
229 |
+
token_ids: Union[List[int], List[List[int]]],
|
230 |
+
num_threads: Optional[int] = None,
|
231 |
+
) -> str:
|
232 |
+
"""
|
233 |
+
Decode a list of token IDs into a string.
|
234 |
+
Args:
|
235 |
+
token_ids (Union[List[int], List[List[int]]]): List of token IDs or lists of token IDs.
|
236 |
+
num_threads (Optional[int]): Number of threads to use for decoding.
|
237 |
+
Returns:
|
238 |
+
str: Decoded string.
|
239 |
+
"""
|
240 |
+
return self.tok.decode(input=token_ids, num_threads=num_threads)
|
241 |
+
|
242 |
+
def _convert_id_to_token(self, index: int) -> str:
|
243 |
+
"""
|
244 |
+
Convert a token ID to its corresponding token string.
|
245 |
+
Args:
|
246 |
+
index (int): Token ID.
|
247 |
+
Returns:
|
248 |
+
str: Corresponding token string.
|
249 |
+
"""
|
250 |
+
return self.tok.IdToPiece(index)
|
251 |
+
|
252 |
+
def convert_tokens_to_string(self, tokens: List[str]) -> str:
|
253 |
+
"""
|
254 |
+
Convert a list of tokens into a single string.
|
255 |
+
Args:
|
256 |
+
tokens (List[str]): List of token strings.
|
257 |
+
Returns:
|
258 |
+
str: Concatenated string of tokens.
|
259 |
+
"""
|
260 |
+
return self.tok.DecodePieces(tokens)
|
261 |
+
|
262 |
+
def _tok_decode(self, token_ids: List[int], **kwargs: Any) -> str:
|
263 |
+
"""
|
264 |
+
Internal method to decode token IDs with additional arguments.
|
265 |
+
Args:
|
266 |
+
token_ids (List[int]): List of token IDs.
|
267 |
+
**kwargs: Additional arguments to pass to the decode method.
|
268 |
+
Returns:
|
269 |
+
str: Decoded string.
|
270 |
+
This method also issues a warning if unsupported arguments are provided.
|
271 |
+
"""
|
272 |
+
passed_kwargs = {key: value for (key, value) in kwargs.items() if key in self.decode_kwargs}
|
273 |
+
if len(passed_kwargs) != len(kwargs):
|
274 |
+
warnings.warn("silently ignoring some arguments to `decode` due to missing " "support from the tokenizer.")
|
275 |
+
text = self.decode(token_ids, **passed_kwargs)
|
276 |
+
return text
|
277 |
+
|
278 |
+
def save_tokenizer(self, save_dir: str) -> None:
|
279 |
+
if not os.path.isdir(save_dir):
|
280 |
+
print(f"Vocabulary path ({save_dir}) should be a directory")
|
281 |
+
return
|
282 |
+
out_vocab_file = os.path.join(save_dir, "tokenizer.model")
|
283 |
+
|
284 |
+
# if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
|
285 |
+
# copyfile(self.vocab_file, out_vocab_file)
|
286 |
+
# elif not os.path.isfile(self.vocab_file):
|
287 |
+
with open(out_vocab_file, "wb") as f:
|
288 |
+
content_spiece_model = self.tok.serialized_model_proto()
|
289 |
+
f.write(content_spiece_model)
|
290 |
+
|
291 |
+
return (out_vocab_file,)
|
292 |
+
|
293 |
+
def _decode(
|
294 |
+
self,
|
295 |
+
token_ids: List[int],
|
296 |
+
skip_special_tokens: bool = False,
|
297 |
+
clean_up_tokenization_spaces: bool = None,
|
298 |
+
spaces_between_special_tokens: bool = True,
|
299 |
+
**kwargs: Any,
|
300 |
+
) -> str:
|
301 |
+
text = self._tok_decode(
|
302 |
+
token_ids,
|
303 |
+
skip_special_tokens=skip_special_tokens,
|
304 |
+
spaces_between_special_tokens=spaces_between_special_tokens,
|
305 |
+
**kwargs,
|
306 |
+
)
|
307 |
+
|
308 |
+
clean_up_tokenization_spaces = (
|
309 |
+
clean_up_tokenization_spaces
|
310 |
+
if clean_up_tokenization_spaces is not None
|
311 |
+
else self.clean_up_tokenization_spaces
|
312 |
+
)
|
313 |
+
if clean_up_tokenization_spaces:
|
314 |
+
warnings.warn(
|
315 |
+
"when cleaning up tokenization spaces, this will not behave "
|
316 |
+
"like the original `GPTXTokenizer`., Please supply "
|
317 |
+
"`clean_up_tokenization_spaces=False` for decoding."
|
318 |
+
)
|
319 |
+
clean_text = self.clean_up_tokenization(text)
|
320 |
+
return clean_text
|
321 |
+
else:
|
322 |
+
return text
|
323 |
+
|
324 |
+
def save_vocabulary(
|
325 |
+
self,
|
326 |
+
save_directory: str,
|
327 |
+
filename_prefix: Optional[str] = None,
|
328 |
+
) -> Tuple[str]:
|
329 |
+
filename_prefix = filename_prefix + "-" if filename_prefix else ""
|
330 |
+
save_directory = Path(save_directory)
|
331 |
+
|
332 |
+
self._save_tokenizer_config(save_directory, filename_prefix)
|
333 |
+
tokenizer_file_path = self._save_tokenizer(save_directory, filename_prefix)
|
334 |
+
|
335 |
+
return (tokenizer_file_path,)
|
336 |
+
|
337 |
+
def _save_tokenizer_config(
|
338 |
+
self,
|
339 |
+
save_directory: Path,
|
340 |
+
filename_prefix: str,
|
341 |
+
) -> str:
|
342 |
+
self.save_tokenizer_config(save_directory)
|
343 |
+
old_tokenizer_config_path = save_directory / TOKENIZER_CONFIG_FILE
|
344 |
+
assert old_tokenizer_config_path.is_file(), "tokenizer config path changed"
|
345 |
+
new_tokenizer_config_path = save_directory / (filename_prefix + old_tokenizer_config_path.name)
|
346 |
+
old_tokenizer_config_path.replace(new_tokenizer_config_path)
|
347 |
+
return str(new_tokenizer_config_path)
|
348 |
+
|
349 |
+
def _find_tokenizer_files(self, save_directory: Path) -> List[Path]:
|
350 |
+
files = list(Path(save_directory).glob(self.model_file_glob))
|
351 |
+
return files
|
352 |
+
|
353 |
+
def _get_tokenizer_file(self, files: List[Path]):
|
354 |
+
assert files, "no saved tokenizer file found"
|
355 |
+
assert len(files) <= 1, "cannot handle multiple saved tokenizer files"
|
356 |
+
return files[0]
|
357 |
+
|
358 |
+
def _save_tokenizer(
|
359 |
+
self,
|
360 |
+
save_directory: Path,
|
361 |
+
filename_prefix: str,
|
362 |
+
) -> str:
|
363 |
+
self.save_tokenizer(str(save_directory))
|
364 |
+
tokenizer_files = self._find_tokenizer_files(save_directory)
|
365 |
+
old_tokenizer_file_path = self._get_tokenizer_file(tokenizer_files)
|
366 |
+
assert old_tokenizer_file_path.is_file(), "could not access saved tokenizer file"
|
367 |
+
new_tokenizer_file_path = save_directory / (filename_prefix + self.vocab_files_names["tokenizer_file"])
|
368 |
+
old_tokenizer_file_path.replace(new_tokenizer_file_path)
|
369 |
+
return str(new_tokenizer_file_path)
|
370 |
+
|
371 |
+
def save_tokenizer_config(self, save_dir: Path) -> None:
|
372 |
+
# convert Path to str
|
373 |
+
for k in self.tokenizer_config:
|
374 |
+
if isinstance(self.tokenizer_config[k], Path):
|
375 |
+
self.tokenizer_config[k] = str(self.tokenizer_config[k])
|
376 |
+
|
377 |
+
info_file = save_dir / "tokenizer_config.json"
|
378 |
+
with info_file.open("w") as f:
|
379 |
+
json.dump(self.tokenizer_config, f, indent=4)
|
380 |
+
|
381 |
+
def load_json(self, path: Path) -> dict:
|
382 |
+
with path.open("r") as f:
|
383 |
+
return json.load(f)
|
384 |
+
|
385 |
+
class SPTokenizer(HFGPTXTokenizer):
|
386 |
+
model_file_glob = "*tokenizer.model"
|
387 |
+
vocab_files_names = {"tokenizer_file": "tokenizer.model"}
|
388 |
+
decode_kwargs = ["num_threads"]
|
389 |
+
# `is_continuation` does not work without this, but it doesn't
|
390 |
+
# implement all APIs of `PreTrainedTokenizer`.
|
391 |
+
def encode(self, text: str, **kwargs) -> List[int]:
|
392 |
+
return_tokens = kwargs.pop('return_tokens', False)
|
393 |
+
is_continuation = kwargs.pop('is_continuation', False)
|
394 |
+
return self._encode(
|
395 |
+
text,
|
396 |
+
return_tokens=return_tokens,
|
397 |
+
is_continuation=is_continuation,
|
398 |
+
)
|
399 |
+
|
400 |
+
def __init__(self, *args, **kwargs):
|
401 |
+
super().__init__(*args, **kwargs)
|
402 |
+
|
403 |
+
self.eos_token = "</s>"
|
404 |
+
self.eos_token_id = 2
|
405 |
+
self.system_messages_by_lang = { # translations by deepl / google translate
|
406 |
+
"BG": "Чат между човек и асистент с изкуствен интелект. Асистентът дава полезни и учтиви отговори на въпросите на човека.", # noqa
|
407 |
+
"CS": "Chat mezi člověkem a asistentem s umělou inteligencí. Asistent poskytuje vstřícné a zdvořilé odpovědi na otázky člověka.", # noqa
|
408 |
+
"DA": "En chat mellem et menneske og en assistent med kunstig intelligens, som giver hjælpsomme og høflige svar på menneskets spørgsmål.", # noqa
|
409 |
+
"DE": "Ein Gespräch zwischen einem Menschen und einem Assistenten mit künstlicher Intelligenz. Der Assistent gibt hilfreiche und höfliche Antworten auf die Fragen des Menschen.", # noqa
|
410 |
+
"EL": "Μια συνομιλία μεταξύ ενός ανθρώπου και ενός βοηθού τεχνητής νοημοσύνης. Ο βοηθός δίνει χρήσιμες και ευγενικές απαντήσεις στις ερωτήσεις του ανθρώπου.", # noqa
|
411 |
+
"EN": "A chat between a human and an artificial intelligence assistant.The assistant gives helpful and polite answers to the human's questions.", # noqa
|
412 |
+
"ES": "Una conversación entre un humano y un asistente de inteligencia artificial. El asistente da respuestas útiles y amables a las preguntas del humano.", # noqa
|
413 |
+
"ET": "Inimese ja tehisintellekti assistendi vaheline vestlus. Assistent annab inimese küsimustele abivalmis ja viisakaid vastuseid.", # noqa
|
414 |
+
"FI": "Ihmisen ja tekoälyavustajan välinen keskustelu. Avustaja antaa avuliaita ja kohteliaita vastauksia ihmisen kysymyksiin.", # noqa
|
415 |
+
"FR": "Conversation entre un humain et un assistant doté d'une intelligence artificielle. L'assistant donne des réponses utiles et polies aux questions de l'homme.", # noqa
|
416 |
+
"GA": "Comhrá idir duine agus cúntóir hintleachta saorga. Tugann an cúntóir freagraí cabhracha dea-bhéasacha ar cheisteanna an duine.", # noqa
|
417 |
+
"HR": "Razgovor između čovjeka i pomoćnika umjetne inteligencije. Pomoćnik daje korisne i ljubazne odgovore na ljudska pitanja.", # noqa
|
418 |
+
"HU": "Egy ember és egy mesterséges intelligencia asszisztens közötti beszélgetés. Az asszisztens segítőkész és udvarias válaszokat ad az ember kérdéseire.", # noqa
|
419 |
+
"IT": "Una chat tra un umano e un assistente di intelligenza artificiale. L'assistente fornisce risposte utili ed educate alle domande dell'uomo.", # noqa
|
420 |
+
"LT": "Žmogaus ir dirbtinio intelekto asistento pokalbis. Asistentas naudingai ir mandagiai atsako į žmogaus klausimus.", # noqa
|
421 |
+
"LV": "Cilvēka un mākslīgā intelekta asistenta tērzēšana. Asistents sniedz noderīgas un pieklājīgas atbildes uz cilvēka jautājumiem.", # noqa
|
422 |
+
"MT": "Chat bejn bniedem u assistent ta' intelliġenza artifiċjali. L-assistent jagħti tweġibiet ta' għajnuna u edukat għall-mistoqsijiet tal-bniedem.", # noqa
|
423 |
+
"NL": "Een chat tussen een mens en een assistent met kunstmatige intelligentie. De assistent geeft behulpzame en beleefde antwoorden op de vragen van de mens.", # noqa
|
424 |
+
"PL": "Czat między człowiekiem a asystentem sztucznej inteligencji. Asystent udziela pomocnych i uprzejmych odpowiedzi na pytania człowieka.", # noqa
|
425 |
+
"PT": "Uma conversa entre um ser humano e um assistente de inteligência artificial. O assistente dá respostas úteis e educadas às perguntas do utilizador.", # noqa
|
426 |
+
"RO": "O conversație între un om și un asistent cu inteligență artificială. Asistentul oferă răspunsuri utile și politicoase la întrebările omului.", # noqa
|
427 |
+
"SK": "Rozhovor medzi človekom a asistentom s umelou inteligenciou. Asistent poskytuje užitočné a zdvorilé odpovede na otázky človeka.", # noqa
|
428 |
+
"SL": "Pogovor med človekom in pomočnikom z umetno inteligenco. Pomočnik človeku prijazno in vljudno odgovarja na njegova vprašanja.", # noqa
|
429 |
+
"SV": "En chatt mellan en människa och en assistent med artificiell intelligens. Assistenten ger hjälpsamma och artiga svar på människans frågor.", # noqa
|
430 |
+
}
|
431 |
+
chat_template = "{%- for message in messages %}\n{%- if (message['role']|lower == 'user') != (loop.index0 % 2 == 0) %}\n{{- raise_exception('Roles must alternate User/Assistant/User/Assistant/...') }}\n{%- endif %}\n{%-if message['role']|lower == 'user' %}\n{{- message['role']|capitalize + ': ' + message['content'] + '\\n' }}\n{%- elif message['role']|lower == 'assistant' %}\n{{- message['role']|capitalize + ': ' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- raise_exception('Only user and assistant roles are supported!') }}\n {%- endif %}\n{%- endfor %}{%-if add_generation_prompt %}\n{{- 'Assistant: '}}\n{%- endif %}\n"
|
432 |
+
self.chat_template = {
|
433 |
+
lang: f"System: {sys_msg}" + "{{- '\\n'}}\n" + chat_template
|
434 |
+
for lang, sys_msg in self.system_messages_by_lang.items()
|
435 |
+
}
|
model-00001-of-00004.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a1bcc58dc5992ab284973433164ec006354ef9fd724a47129b15cccf570fa2eb
|
3 |
+
size 4936228560
|
model-00002-of-00004.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a0691e24daa22758c7a33ec2e2f5bb711c8a62a04f81048f983a17ed50ae51a4
|
3 |
+
size 4929565048
|
model-00003-of-00004.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:75f4f294ce64e3ce09051f54f216ae29482c8f4c707c8c8707a1f9e58844faa1
|
3 |
+
size 4929565072
|
model-00004-of-00004.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f4567e208eb5158297dc5809d5ac3305bd6fddeaceb8b11fedb37658204b32fc
|
3 |
+
size 110125512
|
model.safetensors.index.json
ADDED
@@ -0,0 +1,297 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"metadata": {
|
3 |
+
"total_size": 14905450496
|
4 |
+
},
|
5 |
+
"weight_map": {
|
6 |
+
"model.embed_tokens.weight": "model-00001-of-00004.safetensors",
|
7 |
+
"model.layers.0.input_layernorm.weight": "model-00001-of-00004.safetensors",
|
8 |
+
"model.layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
|
9 |
+
"model.layers.0.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
|
10 |
+
"model.layers.0.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
|
11 |
+
"model.layers.0.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
|
12 |
+
"model.layers.0.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
|
13 |
+
"model.layers.0.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
|
14 |
+
"model.layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
|
15 |
+
"model.layers.0.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
|
16 |
+
"model.layers.1.input_layernorm.weight": "model-00001-of-00004.safetensors",
|
17 |
+
"model.layers.1.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
|
18 |
+
"model.layers.1.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
|
19 |
+
"model.layers.1.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
|
20 |
+
"model.layers.1.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
|
21 |
+
"model.layers.1.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
|
22 |
+
"model.layers.1.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
|
23 |
+
"model.layers.1.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
|
24 |
+
"model.layers.1.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
|
25 |
+
"model.layers.10.input_layernorm.weight": "model-00002-of-00004.safetensors",
|
26 |
+
"model.layers.10.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
|
27 |
+
"model.layers.10.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
|
28 |
+
"model.layers.10.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
|
29 |
+
"model.layers.10.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
|
30 |
+
"model.layers.10.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
|
31 |
+
"model.layers.10.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
|
32 |
+
"model.layers.10.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
|
33 |
+
"model.layers.10.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
|
34 |
+
"model.layers.11.input_layernorm.weight": "model-00002-of-00004.safetensors",
|
35 |
+
"model.layers.11.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
|
36 |
+
"model.layers.11.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
|
37 |
+
"model.layers.11.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
|
38 |
+
"model.layers.11.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
|
39 |
+
"model.layers.11.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
|
40 |
+
"model.layers.11.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
|
41 |
+
"model.layers.11.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
|
42 |
+
"model.layers.11.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
|
43 |
+
"model.layers.12.input_layernorm.weight": "model-00002-of-00004.safetensors",
|
44 |
+
"model.layers.12.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
|
45 |
+
"model.layers.12.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
|
46 |
+
"model.layers.12.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
|
47 |
+
"model.layers.12.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
|
48 |
+
"model.layers.12.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
|
49 |
+
"model.layers.12.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
|
50 |
+
"model.layers.12.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
|
51 |
+
"model.layers.12.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
|
52 |
+
"model.layers.13.input_layernorm.weight": "model-00002-of-00004.safetensors",
|
53 |
+
"model.layers.13.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
|
54 |
+
"model.layers.13.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
|
55 |
+
"model.layers.13.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
|
56 |
+
"model.layers.13.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
|
57 |
+
"model.layers.13.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
|
58 |
+
"model.layers.13.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
|
59 |
+
"model.layers.13.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
|
60 |
+
"model.layers.13.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
|
61 |
+
"model.layers.14.input_layernorm.weight": "model-00002-of-00004.safetensors",
|
62 |
+
"model.layers.14.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
|
63 |
+
"model.layers.14.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
|
64 |
+
"model.layers.14.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
|
65 |
+
"model.layers.14.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
|
66 |
+
"model.layers.14.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
|
67 |
+
"model.layers.14.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
|
68 |
+
"model.layers.14.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
|
69 |
+
"model.layers.14.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
|
70 |
+
"model.layers.15.input_layernorm.weight": "model-00002-of-00004.safetensors",
|
71 |
+
"model.layers.15.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
|
72 |
+
"model.layers.15.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
|
73 |
+
"model.layers.15.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
|
74 |
+
"model.layers.15.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
|
75 |
+
"model.layers.15.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
|
76 |
+
"model.layers.15.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
|
77 |
+
"model.layers.15.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
|
78 |
+
"model.layers.15.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
|
79 |
+
"model.layers.16.input_layernorm.weight": "model-00002-of-00004.safetensors",
|
80 |
+
"model.layers.16.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
|
81 |
+
"model.layers.16.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
|
82 |
+
"model.layers.16.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
|
83 |
+
"model.layers.16.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
|
84 |
+
"model.layers.16.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
|
85 |
+
"model.layers.16.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
|
86 |
+
"model.layers.16.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
|
87 |
+
"model.layers.16.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
|
88 |
+
"model.layers.17.input_layernorm.weight": "model-00002-of-00004.safetensors",
|
89 |
+
"model.layers.17.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
|
90 |
+
"model.layers.17.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
|
91 |
+
"model.layers.17.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
|
92 |
+
"model.layers.17.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
|
93 |
+
"model.layers.17.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
|
94 |
+
"model.layers.17.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
|
95 |
+
"model.layers.17.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
|
96 |
+
"model.layers.17.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
|
97 |
+
"model.layers.18.input_layernorm.weight": "model-00002-of-00004.safetensors",
|
98 |
+
"model.layers.18.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
|
99 |
+
"model.layers.18.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
|
100 |
+
"model.layers.18.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
|
101 |
+
"model.layers.18.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
|
102 |
+
"model.layers.18.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
|
103 |
+
"model.layers.18.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
|
104 |
+
"model.layers.18.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
|
105 |
+
"model.layers.18.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
|
106 |
+
"model.layers.19.input_layernorm.weight": "model-00003-of-00004.safetensors",
|
107 |
+
"model.layers.19.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
|
108 |
+
"model.layers.19.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
|
109 |
+
"model.layers.19.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
|
110 |
+
"model.layers.19.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
|
111 |
+
"model.layers.19.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
|
112 |
+
"model.layers.19.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
|
113 |
+
"model.layers.19.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
|
114 |
+
"model.layers.19.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
|
115 |
+
"model.layers.2.input_layernorm.weight": "model-00001-of-00004.safetensors",
|
116 |
+
"model.layers.2.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
|
117 |
+
"model.layers.2.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
|
118 |
+
"model.layers.2.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
|
119 |
+
"model.layers.2.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
|
120 |
+
"model.layers.2.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
|
121 |
+
"model.layers.2.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
|
122 |
+
"model.layers.2.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
|
123 |
+
"model.layers.2.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
|
124 |
+
"model.layers.20.input_layernorm.weight": "model-00003-of-00004.safetensors",
|
125 |
+
"model.layers.20.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
|
126 |
+
"model.layers.20.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
|
127 |
+
"model.layers.20.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
|
128 |
+
"model.layers.20.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
|
129 |
+
"model.layers.20.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
|
130 |
+
"model.layers.20.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
|
131 |
+
"model.layers.20.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
|
132 |
+
"model.layers.20.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
|
133 |
+
"model.layers.21.input_layernorm.weight": "model-00003-of-00004.safetensors",
|
134 |
+
"model.layers.21.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
|
135 |
+
"model.layers.21.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
|
136 |
+
"model.layers.21.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
|
137 |
+
"model.layers.21.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
|
138 |
+
"model.layers.21.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
|
139 |
+
"model.layers.21.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
|
140 |
+
"model.layers.21.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
|
141 |
+
"model.layers.21.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
|
142 |
+
"model.layers.22.input_layernorm.weight": "model-00003-of-00004.safetensors",
|
143 |
+
"model.layers.22.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
|
144 |
+
"model.layers.22.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
|
145 |
+
"model.layers.22.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
|
146 |
+
"model.layers.22.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
|
147 |
+
"model.layers.22.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
|
148 |
+
"model.layers.22.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
|
149 |
+
"model.layers.22.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
|
150 |
+
"model.layers.22.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
|
151 |
+
"model.layers.23.input_layernorm.weight": "model-00003-of-00004.safetensors",
|
152 |
+
"model.layers.23.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
|
153 |
+
"model.layers.23.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
|
154 |
+
"model.layers.23.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
|
155 |
+
"model.layers.23.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
|
156 |
+
"model.layers.23.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
|
157 |
+
"model.layers.23.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
|
158 |
+
"model.layers.23.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
|
159 |
+
"model.layers.23.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
|
160 |
+
"model.layers.24.input_layernorm.weight": "model-00003-of-00004.safetensors",
|
161 |
+
"model.layers.24.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
|
162 |
+
"model.layers.24.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
|
163 |
+
"model.layers.24.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
|
164 |
+
"model.layers.24.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
|
165 |
+
"model.layers.24.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
|
166 |
+
"model.layers.24.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
|
167 |
+
"model.layers.24.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
|
168 |
+
"model.layers.24.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
|
169 |
+
"model.layers.25.input_layernorm.weight": "model-00003-of-00004.safetensors",
|
170 |
+
"model.layers.25.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
|
171 |
+
"model.layers.25.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
|
172 |
+
"model.layers.25.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
|
173 |
+
"model.layers.25.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
|
174 |
+
"model.layers.25.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
|
175 |
+
"model.layers.25.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
|
176 |
+
"model.layers.25.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
|
177 |
+
"model.layers.25.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
|
178 |
+
"model.layers.26.input_layernorm.weight": "model-00003-of-00004.safetensors",
|
179 |
+
"model.layers.26.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
|
180 |
+
"model.layers.26.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
|
181 |
+
"model.layers.26.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
|
182 |
+
"model.layers.26.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
|
183 |
+
"model.layers.26.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
|
184 |
+
"model.layers.26.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
|
185 |
+
"model.layers.26.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
|
186 |
+
"model.layers.26.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
|
187 |
+
"model.layers.27.input_layernorm.weight": "model-00003-of-00004.safetensors",
|
188 |
+
"model.layers.27.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
|
189 |
+
"model.layers.27.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
|
190 |
+
"model.layers.27.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
|
191 |
+
"model.layers.27.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
|
192 |
+
"model.layers.27.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
|
193 |
+
"model.layers.27.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
|
194 |
+
"model.layers.27.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
|
195 |
+
"model.layers.27.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
|
196 |
+
"model.layers.28.input_layernorm.weight": "model-00003-of-00004.safetensors",
|
197 |
+
"model.layers.28.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
|
198 |
+
"model.layers.28.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
|
199 |
+
"model.layers.28.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
|
200 |
+
"model.layers.28.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
|
201 |
+
"model.layers.28.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
|
202 |
+
"model.layers.28.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
|
203 |
+
"model.layers.28.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
|
204 |
+
"model.layers.28.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
|
205 |
+
"model.layers.29.input_layernorm.weight": "model-00003-of-00004.safetensors",
|
206 |
+
"model.layers.29.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
|
207 |
+
"model.layers.29.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
|
208 |
+
"model.layers.29.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
|
209 |
+
"model.layers.29.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
|
210 |
+
"model.layers.29.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
|
211 |
+
"model.layers.29.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
|
212 |
+
"model.layers.29.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
|
213 |
+
"model.layers.29.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
|
214 |
+
"model.layers.3.input_layernorm.weight": "model-00001-of-00004.safetensors",
|
215 |
+
"model.layers.3.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
|
216 |
+
"model.layers.3.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
|
217 |
+
"model.layers.3.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
|
218 |
+
"model.layers.3.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
|
219 |
+
"model.layers.3.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
|
220 |
+
"model.layers.3.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
|
221 |
+
"model.layers.3.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
|
222 |
+
"model.layers.3.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
|
223 |
+
"model.layers.30.input_layernorm.weight": "model-00003-of-00004.safetensors",
|
224 |
+
"model.layers.30.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
|
225 |
+
"model.layers.30.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
|
226 |
+
"model.layers.30.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
|
227 |
+
"model.layers.30.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
|
228 |
+
"model.layers.30.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
|
229 |
+
"model.layers.30.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
|
230 |
+
"model.layers.30.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
|
231 |
+
"model.layers.30.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
|
232 |
+
"model.layers.31.input_layernorm.weight": "model-00004-of-00004.safetensors",
|
233 |
+
"model.layers.31.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
|
234 |
+
"model.layers.31.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
|
235 |
+
"model.layers.31.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
|
236 |
+
"model.layers.31.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
|
237 |
+
"model.layers.31.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
|
238 |
+
"model.layers.31.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
|
239 |
+
"model.layers.31.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
|
240 |
+
"model.layers.31.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
|
241 |
+
"model.layers.4.input_layernorm.weight": "model-00001-of-00004.safetensors",
|
242 |
+
"model.layers.4.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
|
243 |
+
"model.layers.4.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
|
244 |
+
"model.layers.4.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
|
245 |
+
"model.layers.4.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
|
246 |
+
"model.layers.4.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
|
247 |
+
"model.layers.4.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
|
248 |
+
"model.layers.4.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
|
249 |
+
"model.layers.4.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
|
250 |
+
"model.layers.5.input_layernorm.weight": "model-00001-of-00004.safetensors",
|
251 |
+
"model.layers.5.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
|
252 |
+
"model.layers.5.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
|
253 |
+
"model.layers.5.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
|
254 |
+
"model.layers.5.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
|
255 |
+
"model.layers.5.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
|
256 |
+
"model.layers.5.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
|
257 |
+
"model.layers.5.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
|
258 |
+
"model.layers.5.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
|
259 |
+
"model.layers.6.input_layernorm.weight": "model-00001-of-00004.safetensors",
|
260 |
+
"model.layers.6.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
|
261 |
+
"model.layers.6.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
|
262 |
+
"model.layers.6.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
|
263 |
+
"model.layers.6.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
|
264 |
+
"model.layers.6.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
|
265 |
+
"model.layers.6.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
|
266 |
+
"model.layers.6.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
|
267 |
+
"model.layers.6.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
|
268 |
+
"model.layers.7.input_layernorm.weight": "model-00002-of-00004.safetensors",
|
269 |
+
"model.layers.7.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
|
270 |
+
"model.layers.7.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
|
271 |
+
"model.layers.7.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
|
272 |
+
"model.layers.7.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
|
273 |
+
"model.layers.7.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
|
274 |
+
"model.layers.7.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
|
275 |
+
"model.layers.7.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
|
276 |
+
"model.layers.7.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
|
277 |
+
"model.layers.8.input_layernorm.weight": "model-00002-of-00004.safetensors",
|
278 |
+
"model.layers.8.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
|
279 |
+
"model.layers.8.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
|
280 |
+
"model.layers.8.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
|
281 |
+
"model.layers.8.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
|
282 |
+
"model.layers.8.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
|
283 |
+
"model.layers.8.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
|
284 |
+
"model.layers.8.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
|
285 |
+
"model.layers.8.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
|
286 |
+
"model.layers.9.input_layernorm.weight": "model-00002-of-00004.safetensors",
|
287 |
+
"model.layers.9.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
|
288 |
+
"model.layers.9.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
|
289 |
+
"model.layers.9.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
|
290 |
+
"model.layers.9.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
|
291 |
+
"model.layers.9.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
|
292 |
+
"model.layers.9.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
|
293 |
+
"model.layers.9.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
|
294 |
+
"model.layers.9.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
|
295 |
+
"model.norm.weight": "model-00004-of-00004.safetensors"
|
296 |
+
}
|
297 |
+
}
|
special_tokens_map.json
ADDED
@@ -0,0 +1,264 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"additional_special_tokens": [
|
3 |
+
"</s>",
|
4 |
+
"<placeholder_tok_0>",
|
5 |
+
"<placeholder_tok_1>",
|
6 |
+
"<placeholder_tok_2>",
|
7 |
+
"<placeholder_tok_3>",
|
8 |
+
"<placeholder_tok_4>",
|
9 |
+
"<placeholder_tok_5>",
|
10 |
+
"<placeholder_tok_6>",
|
11 |
+
"<placeholder_tok_7>",
|
12 |
+
"<placeholder_tok_8>",
|
13 |
+
"<placeholder_tok_9>",
|
14 |
+
"<placeholder_tok_10>",
|
15 |
+
"<placeholder_tok_11>",
|
16 |
+
"<placeholder_tok_12>",
|
17 |
+
"<placeholder_tok_13>",
|
18 |
+
"<placeholder_tok_14>",
|
19 |
+
"<placeholder_tok_15>",
|
20 |
+
"<placeholder_tok_16>",
|
21 |
+
"<placeholder_tok_17>",
|
22 |
+
"<placeholder_tok_18>",
|
23 |
+
"<placeholder_tok_19>",
|
24 |
+
"<placeholder_tok_20>",
|
25 |
+
"<placeholder_tok_21>",
|
26 |
+
"<placeholder_tok_22>",
|
27 |
+
"<placeholder_tok_23>",
|
28 |
+
"<placeholder_tok_24>",
|
29 |
+
"<placeholder_tok_25>",
|
30 |
+
"<placeholder_tok_26>",
|
31 |
+
"<placeholder_tok_27>",
|
32 |
+
"<placeholder_tok_28>",
|
33 |
+
"<placeholder_tok_29>",
|
34 |
+
"<placeholder_tok_30>",
|
35 |
+
"<placeholder_tok_31>",
|
36 |
+
"<placeholder_tok_32>",
|
37 |
+
"<placeholder_tok_33>",
|
38 |
+
"<placeholder_tok_34>",
|
39 |
+
"<placeholder_tok_35>",
|
40 |
+
"<placeholder_tok_36>",
|
41 |
+
"<placeholder_tok_37>",
|
42 |
+
"<placeholder_tok_38>",
|
43 |
+
"<placeholder_tok_39>",
|
44 |
+
"<placeholder_tok_40>",
|
45 |
+
"<placeholder_tok_41>",
|
46 |
+
"<placeholder_tok_42>",
|
47 |
+
"<placeholder_tok_43>",
|
48 |
+
"<placeholder_tok_44>",
|
49 |
+
"<placeholder_tok_45>",
|
50 |
+
"<placeholder_tok_46>",
|
51 |
+
"<placeholder_tok_47>",
|
52 |
+
"<placeholder_tok_48>",
|
53 |
+
"<placeholder_tok_49>",
|
54 |
+
"<placeholder_tok_50>",
|
55 |
+
"<placeholder_tok_51>",
|
56 |
+
"<placeholder_tok_52>",
|
57 |
+
"<placeholder_tok_53>",
|
58 |
+
"<placeholder_tok_54>",
|
59 |
+
"<placeholder_tok_55>",
|
60 |
+
"<placeholder_tok_56>",
|
61 |
+
"<placeholder_tok_57>",
|
62 |
+
"<placeholder_tok_58>",
|
63 |
+
"<placeholder_tok_59>",
|
64 |
+
"<placeholder_tok_60>",
|
65 |
+
"<placeholder_tok_61>",
|
66 |
+
"<placeholder_tok_62>",
|
67 |
+
"<placeholder_tok_63>",
|
68 |
+
"<placeholder_tok_64>",
|
69 |
+
"<placeholder_tok_65>",
|
70 |
+
"<placeholder_tok_66>",
|
71 |
+
"<placeholder_tok_67>",
|
72 |
+
"<placeholder_tok_68>",
|
73 |
+
"<placeholder_tok_69>",
|
74 |
+
"<placeholder_tok_70>",
|
75 |
+
"<placeholder_tok_71>",
|
76 |
+
"<placeholder_tok_72>",
|
77 |
+
"<placeholder_tok_73>",
|
78 |
+
"<placeholder_tok_74>",
|
79 |
+
"<placeholder_tok_75>",
|
80 |
+
"<placeholder_tok_76>",
|
81 |
+
"<placeholder_tok_77>",
|
82 |
+
"<placeholder_tok_78>",
|
83 |
+
"<placeholder_tok_79>",
|
84 |
+
"<placeholder_tok_80>",
|
85 |
+
"<placeholder_tok_81>",
|
86 |
+
"<placeholder_tok_82>",
|
87 |
+
"<placeholder_tok_83>",
|
88 |
+
"<placeholder_tok_84>",
|
89 |
+
"<placeholder_tok_85>",
|
90 |
+
"<placeholder_tok_86>",
|
91 |
+
"<placeholder_tok_87>",
|
92 |
+
"<placeholder_tok_88>",
|
93 |
+
"<placeholder_tok_89>",
|
94 |
+
"<placeholder_tok_90>",
|
95 |
+
"<placeholder_tok_91>",
|
96 |
+
"<placeholder_tok_92>",
|
97 |
+
"<placeholder_tok_93>",
|
98 |
+
"<placeholder_tok_94>",
|
99 |
+
"<placeholder_tok_95>",
|
100 |
+
"<placeholder_tok_96>",
|
101 |
+
"<placeholder_tok_97>",
|
102 |
+
"<placeholder_tok_98>",
|
103 |
+
"<placeholder_tok_99>",
|
104 |
+
"<placeholder_tok_100>",
|
105 |
+
"<placeholder_tok_101>",
|
106 |
+
"<placeholder_tok_102>",
|
107 |
+
"<placeholder_tok_103>",
|
108 |
+
"<placeholder_tok_104>",
|
109 |
+
"<placeholder_tok_105>",
|
110 |
+
"<placeholder_tok_106>",
|
111 |
+
"<placeholder_tok_107>",
|
112 |
+
"<placeholder_tok_108>",
|
113 |
+
"<placeholder_tok_109>",
|
114 |
+
"<placeholder_tok_110>",
|
115 |
+
"<placeholder_tok_111>",
|
116 |
+
"<placeholder_tok_112>",
|
117 |
+
"<placeholder_tok_113>",
|
118 |
+
"<placeholder_tok_114>",
|
119 |
+
"<placeholder_tok_115>",
|
120 |
+
"<placeholder_tok_116>",
|
121 |
+
"<placeholder_tok_117>",
|
122 |
+
"<placeholder_tok_118>",
|
123 |
+
"<placeholder_tok_119>",
|
124 |
+
"<placeholder_tok_120>",
|
125 |
+
"<placeholder_tok_121>",
|
126 |
+
"<placeholder_tok_122>",
|
127 |
+
"<placeholder_tok_123>",
|
128 |
+
"<placeholder_tok_124>",
|
129 |
+
"<placeholder_tok_125>",
|
130 |
+
"<placeholder_tok_126>",
|
131 |
+
"<placeholder_tok_127>",
|
132 |
+
"<placeholder_tok_128>",
|
133 |
+
"<placeholder_tok_129>",
|
134 |
+
"<placeholder_tok_130>",
|
135 |
+
"<placeholder_tok_131>",
|
136 |
+
"<placeholder_tok_132>",
|
137 |
+
"<placeholder_tok_133>",
|
138 |
+
"<placeholder_tok_134>",
|
139 |
+
"<placeholder_tok_135>",
|
140 |
+
"<placeholder_tok_136>",
|
141 |
+
"<placeholder_tok_137>",
|
142 |
+
"<placeholder_tok_138>",
|
143 |
+
"<placeholder_tok_139>",
|
144 |
+
"<placeholder_tok_140>",
|
145 |
+
"<placeholder_tok_141>",
|
146 |
+
"<placeholder_tok_142>",
|
147 |
+
"<placeholder_tok_143>",
|
148 |
+
"<placeholder_tok_144>",
|
149 |
+
"<placeholder_tok_145>",
|
150 |
+
"<placeholder_tok_146>",
|
151 |
+
"<placeholder_tok_147>",
|
152 |
+
"<placeholder_tok_148>",
|
153 |
+
"<placeholder_tok_149>",
|
154 |
+
"<placeholder_tok_150>",
|
155 |
+
"<placeholder_tok_151>",
|
156 |
+
"<placeholder_tok_152>",
|
157 |
+
"<placeholder_tok_153>",
|
158 |
+
"<placeholder_tok_154>",
|
159 |
+
"<placeholder_tok_155>",
|
160 |
+
"<placeholder_tok_156>",
|
161 |
+
"<placeholder_tok_157>",
|
162 |
+
"<placeholder_tok_158>",
|
163 |
+
"<placeholder_tok_159>",
|
164 |
+
"<placeholder_tok_160>",
|
165 |
+
"<placeholder_tok_161>",
|
166 |
+
"<placeholder_tok_162>",
|
167 |
+
"<placeholder_tok_163>",
|
168 |
+
"<placeholder_tok_164>",
|
169 |
+
"<placeholder_tok_165>",
|
170 |
+
"<placeholder_tok_166>",
|
171 |
+
"<placeholder_tok_167>",
|
172 |
+
"<placeholder_tok_168>",
|
173 |
+
"<placeholder_tok_169>",
|
174 |
+
"<placeholder_tok_170>",
|
175 |
+
"<placeholder_tok_171>",
|
176 |
+
"<placeholder_tok_172>",
|
177 |
+
"<placeholder_tok_173>",
|
178 |
+
"<placeholder_tok_174>",
|
179 |
+
"<placeholder_tok_175>",
|
180 |
+
"<placeholder_tok_176>",
|
181 |
+
"<placeholder_tok_177>",
|
182 |
+
"<placeholder_tok_178>",
|
183 |
+
"<placeholder_tok_179>",
|
184 |
+
"<placeholder_tok_180>",
|
185 |
+
"<placeholder_tok_181>",
|
186 |
+
"<placeholder_tok_182>",
|
187 |
+
"<placeholder_tok_183>",
|
188 |
+
"<placeholder_tok_184>",
|
189 |
+
"<placeholder_tok_185>",
|
190 |
+
"<placeholder_tok_186>",
|
191 |
+
"<placeholder_tok_187>",
|
192 |
+
"<placeholder_tok_188>",
|
193 |
+
"<placeholder_tok_189>",
|
194 |
+
"<placeholder_tok_190>",
|
195 |
+
"<placeholder_tok_191>",
|
196 |
+
"<placeholder_tok_192>",
|
197 |
+
"<placeholder_tok_193>",
|
198 |
+
"<placeholder_tok_194>",
|
199 |
+
"<placeholder_tok_195>",
|
200 |
+
"<placeholder_tok_196>",
|
201 |
+
"<placeholder_tok_197>",
|
202 |
+
"<placeholder_tok_198>",
|
203 |
+
"<placeholder_tok_199>",
|
204 |
+
"<placeholder_tok_200>",
|
205 |
+
"<placeholder_tok_201>",
|
206 |
+
"<placeholder_tok_202>",
|
207 |
+
"<placeholder_tok_203>",
|
208 |
+
"<placeholder_tok_204>",
|
209 |
+
"<placeholder_tok_205>",
|
210 |
+
"<placeholder_tok_206>",
|
211 |
+
"<placeholder_tok_207>",
|
212 |
+
"<placeholder_tok_208>",
|
213 |
+
"<placeholder_tok_209>",
|
214 |
+
"<placeholder_tok_210>",
|
215 |
+
"<placeholder_tok_211>",
|
216 |
+
"<placeholder_tok_212>",
|
217 |
+
"<placeholder_tok_213>",
|
218 |
+
"<placeholder_tok_214>",
|
219 |
+
"<placeholder_tok_215>",
|
220 |
+
"<placeholder_tok_216>",
|
221 |
+
"<placeholder_tok_217>",
|
222 |
+
"<placeholder_tok_218>",
|
223 |
+
"<placeholder_tok_219>",
|
224 |
+
"<placeholder_tok_220>",
|
225 |
+
"<placeholder_tok_221>",
|
226 |
+
"<placeholder_tok_222>",
|
227 |
+
"<placeholder_tok_223>",
|
228 |
+
"<placeholder_tok_224>",
|
229 |
+
"<placeholder_tok_225>",
|
230 |
+
"<placeholder_tok_226>",
|
231 |
+
"<placeholder_tok_227>",
|
232 |
+
"<placeholder_tok_228>",
|
233 |
+
"<placeholder_tok_229>",
|
234 |
+
"<placeholder_tok_230>",
|
235 |
+
"<placeholder_tok_231>",
|
236 |
+
"<placeholder_tok_232>",
|
237 |
+
"<placeholder_tok_233>",
|
238 |
+
"<placeholder_tok_234>",
|
239 |
+
"<placeholder_tok_235>",
|
240 |
+
"<placeholder_tok_236>",
|
241 |
+
"<placeholder_tok_237>",
|
242 |
+
"<placeholder_tok_238>",
|
243 |
+
"<placeholder_tok_239>",
|
244 |
+
"<placeholder_tok_240>",
|
245 |
+
"<placeholder_tok_241>",
|
246 |
+
"<placeholder_tok_242>",
|
247 |
+
"<placeholder_tok_243>",
|
248 |
+
"<placeholder_tok_244>",
|
249 |
+
"<placeholder_tok_245>",
|
250 |
+
"<placeholder_tok_246>",
|
251 |
+
"<placeholder_tok_247>",
|
252 |
+
"<placeholder_tok_248>",
|
253 |
+
"<placeholder_tok_249>",
|
254 |
+
"<placeholder_tok_250>",
|
255 |
+
"<placeholder_tok_251>",
|
256 |
+
"<placeholder_tok_252>",
|
257 |
+
"<placeholder_tok_253>",
|
258 |
+
"<placeholder_tok_254>",
|
259 |
+
"<placeholder_tok_255>"
|
260 |
+
],
|
261 |
+
"bos_token": "<s>",
|
262 |
+
"eos_token": "<eod>",
|
263 |
+
"pad_token": "<pad>"
|
264 |
+
}
|
tokenizer.model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:08d0c8316539a853f2fe6e14f51f0df583011dfb078fa08c8b6dc5c15a19a7e6
|
3 |
+
size 4719922
|
tokenizer_config.json
ADDED
@@ -0,0 +1,292 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"num_threads": 224,
|
3 |
+
"split_by_whitespace": true,
|
4 |
+
"model_type": "unigram",
|
5 |
+
"vocab_size": 250680,
|
6 |
+
"character_coverage": 0.9999,
|
7 |
+
"byte_fallback": true,
|
8 |
+
"split_by_number": true,
|
9 |
+
"split_digits": true,
|
10 |
+
"normalization_rule_name": "nfkc",
|
11 |
+
"max_sentence_length": 4096,
|
12 |
+
"shuffle_input_sentence": true,
|
13 |
+
"input_sentence_size": 0,
|
14 |
+
"train_extremely_large_corpus": true,
|
15 |
+
"allow_whitespace_only_pieces": true,
|
16 |
+
"required_chars": "",
|
17 |
+
"remove_extra_whitespaces": false,
|
18 |
+
"user_defined_symbols": [
|
19 |
+
"<s>",
|
20 |
+
"</s>",
|
21 |
+
"<pad>",
|
22 |
+
"<eod>",
|
23 |
+
"<placeholder_tok_0>",
|
24 |
+
"<placeholder_tok_1>",
|
25 |
+
"<placeholder_tok_2>",
|
26 |
+
"<placeholder_tok_3>",
|
27 |
+
"<placeholder_tok_4>",
|
28 |
+
"<placeholder_tok_5>",
|
29 |
+
"<placeholder_tok_6>",
|
30 |
+
"<placeholder_tok_7>",
|
31 |
+
"<placeholder_tok_8>",
|
32 |
+
"<placeholder_tok_9>",
|
33 |
+
"<placeholder_tok_10>",
|
34 |
+
"<placeholder_tok_11>",
|
35 |
+
"<placeholder_tok_12>",
|
36 |
+
"<placeholder_tok_13>",
|
37 |
+
"<placeholder_tok_14>",
|
38 |
+
"<placeholder_tok_15>",
|
39 |
+
"<placeholder_tok_16>",
|
40 |
+
"<placeholder_tok_17>",
|
41 |
+
"<placeholder_tok_18>",
|
42 |
+
"<placeholder_tok_19>",
|
43 |
+
"<placeholder_tok_20>",
|
44 |
+
"<placeholder_tok_21>",
|
45 |
+
"<placeholder_tok_22>",
|
46 |
+
"<placeholder_tok_23>",
|
47 |
+
"<placeholder_tok_24>",
|
48 |
+
"<placeholder_tok_25>",
|
49 |
+
"<placeholder_tok_26>",
|
50 |
+
"<placeholder_tok_27>",
|
51 |
+
"<placeholder_tok_28>",
|
52 |
+
"<placeholder_tok_29>",
|
53 |
+
"<placeholder_tok_30>",
|
54 |
+
"<placeholder_tok_31>",
|
55 |
+
"<placeholder_tok_32>",
|
56 |
+
"<placeholder_tok_33>",
|
57 |
+
"<placeholder_tok_34>",
|
58 |
+
"<placeholder_tok_35>",
|
59 |
+
"<placeholder_tok_36>",
|
60 |
+
"<placeholder_tok_37>",
|
61 |
+
"<placeholder_tok_38>",
|
62 |
+
"<placeholder_tok_39>",
|
63 |
+
"<placeholder_tok_40>",
|
64 |
+
"<placeholder_tok_41>",
|
65 |
+
"<placeholder_tok_42>",
|
66 |
+
"<placeholder_tok_43>",
|
67 |
+
"<placeholder_tok_44>",
|
68 |
+
"<placeholder_tok_45>",
|
69 |
+
"<placeholder_tok_46>",
|
70 |
+
"<placeholder_tok_47>",
|
71 |
+
"<placeholder_tok_48>",
|
72 |
+
"<placeholder_tok_49>",
|
73 |
+
"<placeholder_tok_50>",
|
74 |
+
"<placeholder_tok_51>",
|
75 |
+
"<placeholder_tok_52>",
|
76 |
+
"<placeholder_tok_53>",
|
77 |
+
"<placeholder_tok_54>",
|
78 |
+
"<placeholder_tok_55>",
|
79 |
+
"<placeholder_tok_56>",
|
80 |
+
"<placeholder_tok_57>",
|
81 |
+
"<placeholder_tok_58>",
|
82 |
+
"<placeholder_tok_59>",
|
83 |
+
"<placeholder_tok_60>",
|
84 |
+
"<placeholder_tok_61>",
|
85 |
+
"<placeholder_tok_62>",
|
86 |
+
"<placeholder_tok_63>",
|
87 |
+
"<placeholder_tok_64>",
|
88 |
+
"<placeholder_tok_65>",
|
89 |
+
"<placeholder_tok_66>",
|
90 |
+
"<placeholder_tok_67>",
|
91 |
+
"<placeholder_tok_68>",
|
92 |
+
"<placeholder_tok_69>",
|
93 |
+
"<placeholder_tok_70>",
|
94 |
+
"<placeholder_tok_71>",
|
95 |
+
"<placeholder_tok_72>",
|
96 |
+
"<placeholder_tok_73>",
|
97 |
+
"<placeholder_tok_74>",
|
98 |
+
"<placeholder_tok_75>",
|
99 |
+
"<placeholder_tok_76>",
|
100 |
+
"<placeholder_tok_77>",
|
101 |
+
"<placeholder_tok_78>",
|
102 |
+
"<placeholder_tok_79>",
|
103 |
+
"<placeholder_tok_80>",
|
104 |
+
"<placeholder_tok_81>",
|
105 |
+
"<placeholder_tok_82>",
|
106 |
+
"<placeholder_tok_83>",
|
107 |
+
"<placeholder_tok_84>",
|
108 |
+
"<placeholder_tok_85>",
|
109 |
+
"<placeholder_tok_86>",
|
110 |
+
"<placeholder_tok_87>",
|
111 |
+
"<placeholder_tok_88>",
|
112 |
+
"<placeholder_tok_89>",
|
113 |
+
"<placeholder_tok_90>",
|
114 |
+
"<placeholder_tok_91>",
|
115 |
+
"<placeholder_tok_92>",
|
116 |
+
"<placeholder_tok_93>",
|
117 |
+
"<placeholder_tok_94>",
|
118 |
+
"<placeholder_tok_95>",
|
119 |
+
"<placeholder_tok_96>",
|
120 |
+
"<placeholder_tok_97>",
|
121 |
+
"<placeholder_tok_98>",
|
122 |
+
"<placeholder_tok_99>",
|
123 |
+
"<placeholder_tok_100>",
|
124 |
+
"<placeholder_tok_101>",
|
125 |
+
"<placeholder_tok_102>",
|
126 |
+
"<placeholder_tok_103>",
|
127 |
+
"<placeholder_tok_104>",
|
128 |
+
"<placeholder_tok_105>",
|
129 |
+
"<placeholder_tok_106>",
|
130 |
+
"<placeholder_tok_107>",
|
131 |
+
"<placeholder_tok_108>",
|
132 |
+
"<placeholder_tok_109>",
|
133 |
+
"<placeholder_tok_110>",
|
134 |
+
"<placeholder_tok_111>",
|
135 |
+
"<placeholder_tok_112>",
|
136 |
+
"<placeholder_tok_113>",
|
137 |
+
"<placeholder_tok_114>",
|
138 |
+
"<placeholder_tok_115>",
|
139 |
+
"<placeholder_tok_116>",
|
140 |
+
"<placeholder_tok_117>",
|
141 |
+
"<placeholder_tok_118>",
|
142 |
+
"<placeholder_tok_119>",
|
143 |
+
"<placeholder_tok_120>",
|
144 |
+
"<placeholder_tok_121>",
|
145 |
+
"<placeholder_tok_122>",
|
146 |
+
"<placeholder_tok_123>",
|
147 |
+
"<placeholder_tok_124>",
|
148 |
+
"<placeholder_tok_125>",
|
149 |
+
"<placeholder_tok_126>",
|
150 |
+
"<placeholder_tok_127>",
|
151 |
+
"<placeholder_tok_128>",
|
152 |
+
"<placeholder_tok_129>",
|
153 |
+
"<placeholder_tok_130>",
|
154 |
+
"<placeholder_tok_131>",
|
155 |
+
"<placeholder_tok_132>",
|
156 |
+
"<placeholder_tok_133>",
|
157 |
+
"<placeholder_tok_134>",
|
158 |
+
"<placeholder_tok_135>",
|
159 |
+
"<placeholder_tok_136>",
|
160 |
+
"<placeholder_tok_137>",
|
161 |
+
"<placeholder_tok_138>",
|
162 |
+
"<placeholder_tok_139>",
|
163 |
+
"<placeholder_tok_140>",
|
164 |
+
"<placeholder_tok_141>",
|
165 |
+
"<placeholder_tok_142>",
|
166 |
+
"<placeholder_tok_143>",
|
167 |
+
"<placeholder_tok_144>",
|
168 |
+
"<placeholder_tok_145>",
|
169 |
+
"<placeholder_tok_146>",
|
170 |
+
"<placeholder_tok_147>",
|
171 |
+
"<placeholder_tok_148>",
|
172 |
+
"<placeholder_tok_149>",
|
173 |
+
"<placeholder_tok_150>",
|
174 |
+
"<placeholder_tok_151>",
|
175 |
+
"<placeholder_tok_152>",
|
176 |
+
"<placeholder_tok_153>",
|
177 |
+
"<placeholder_tok_154>",
|
178 |
+
"<placeholder_tok_155>",
|
179 |
+
"<placeholder_tok_156>",
|
180 |
+
"<placeholder_tok_157>",
|
181 |
+
"<placeholder_tok_158>",
|
182 |
+
"<placeholder_tok_159>",
|
183 |
+
"<placeholder_tok_160>",
|
184 |
+
"<placeholder_tok_161>",
|
185 |
+
"<placeholder_tok_162>",
|
186 |
+
"<placeholder_tok_163>",
|
187 |
+
"<placeholder_tok_164>",
|
188 |
+
"<placeholder_tok_165>",
|
189 |
+
"<placeholder_tok_166>",
|
190 |
+
"<placeholder_tok_167>",
|
191 |
+
"<placeholder_tok_168>",
|
192 |
+
"<placeholder_tok_169>",
|
193 |
+
"<placeholder_tok_170>",
|
194 |
+
"<placeholder_tok_171>",
|
195 |
+
"<placeholder_tok_172>",
|
196 |
+
"<placeholder_tok_173>",
|
197 |
+
"<placeholder_tok_174>",
|
198 |
+
"<placeholder_tok_175>",
|
199 |
+
"<placeholder_tok_176>",
|
200 |
+
"<placeholder_tok_177>",
|
201 |
+
"<placeholder_tok_178>",
|
202 |
+
"<placeholder_tok_179>",
|
203 |
+
"<placeholder_tok_180>",
|
204 |
+
"<placeholder_tok_181>",
|
205 |
+
"<placeholder_tok_182>",
|
206 |
+
"<placeholder_tok_183>",
|
207 |
+
"<placeholder_tok_184>",
|
208 |
+
"<placeholder_tok_185>",
|
209 |
+
"<placeholder_tok_186>",
|
210 |
+
"<placeholder_tok_187>",
|
211 |
+
"<placeholder_tok_188>",
|
212 |
+
"<placeholder_tok_189>",
|
213 |
+
"<placeholder_tok_190>",
|
214 |
+
"<placeholder_tok_191>",
|
215 |
+
"<placeholder_tok_192>",
|
216 |
+
"<placeholder_tok_193>",
|
217 |
+
"<placeholder_tok_194>",
|
218 |
+
"<placeholder_tok_195>",
|
219 |
+
"<placeholder_tok_196>",
|
220 |
+
"<placeholder_tok_197>",
|
221 |
+
"<placeholder_tok_198>",
|
222 |
+
"<placeholder_tok_199>",
|
223 |
+
"<placeholder_tok_200>",
|
224 |
+
"<placeholder_tok_201>",
|
225 |
+
"<placeholder_tok_202>",
|
226 |
+
"<placeholder_tok_203>",
|
227 |
+
"<placeholder_tok_204>",
|
228 |
+
"<placeholder_tok_205>",
|
229 |
+
"<placeholder_tok_206>",
|
230 |
+
"<placeholder_tok_207>",
|
231 |
+
"<placeholder_tok_208>",
|
232 |
+
"<placeholder_tok_209>",
|
233 |
+
"<placeholder_tok_210>",
|
234 |
+
"<placeholder_tok_211>",
|
235 |
+
"<placeholder_tok_212>",
|
236 |
+
"<placeholder_tok_213>",
|
237 |
+
"<placeholder_tok_214>",
|
238 |
+
"<placeholder_tok_215>",
|
239 |
+
"<placeholder_tok_216>",
|
240 |
+
"<placeholder_tok_217>",
|
241 |
+
"<placeholder_tok_218>",
|
242 |
+
"<placeholder_tok_219>",
|
243 |
+
"<placeholder_tok_220>",
|
244 |
+
"<placeholder_tok_221>",
|
245 |
+
"<placeholder_tok_222>",
|
246 |
+
"<placeholder_tok_223>",
|
247 |
+
"<placeholder_tok_224>",
|
248 |
+
"<placeholder_tok_225>",
|
249 |
+
"<placeholder_tok_226>",
|
250 |
+
"<placeholder_tok_227>",
|
251 |
+
"<placeholder_tok_228>",
|
252 |
+
"<placeholder_tok_229>",
|
253 |
+
"<placeholder_tok_230>",
|
254 |
+
"<placeholder_tok_231>",
|
255 |
+
"<placeholder_tok_232>",
|
256 |
+
"<placeholder_tok_233>",
|
257 |
+
"<placeholder_tok_234>",
|
258 |
+
"<placeholder_tok_235>",
|
259 |
+
"<placeholder_tok_236>",
|
260 |
+
"<placeholder_tok_237>",
|
261 |
+
"<placeholder_tok_238>",
|
262 |
+
"<placeholder_tok_239>",
|
263 |
+
"<placeholder_tok_240>",
|
264 |
+
"<placeholder_tok_241>",
|
265 |
+
"<placeholder_tok_242>",
|
266 |
+
"<placeholder_tok_243>",
|
267 |
+
"<placeholder_tok_244>",
|
268 |
+
"<placeholder_tok_245>",
|
269 |
+
"<placeholder_tok_246>",
|
270 |
+
"<placeholder_tok_247>",
|
271 |
+
"<placeholder_tok_248>",
|
272 |
+
"<placeholder_tok_249>",
|
273 |
+
"<placeholder_tok_250>",
|
274 |
+
"<placeholder_tok_251>",
|
275 |
+
"<placeholder_tok_252>",
|
276 |
+
"<placeholder_tok_253>",
|
277 |
+
"<placeholder_tok_254>",
|
278 |
+
"<placeholder_tok_255>"
|
279 |
+
],
|
280 |
+
"datasets_dir": "/home/fhgiais/gptx_ablations/bias_analysis/data/tokenizer/temp/",
|
281 |
+
"save_dir": "/home/fhgiais/gptx_ablations/bias_analysis/tokenizer/24",
|
282 |
+
"text_key": "text",
|
283 |
+
"cache_dir": "/home/fhgiais/gptx_ablations/bias_analysis/tokenizer/24/cache",
|
284 |
+
"library": "sentencepiece",
|
285 |
+
"auto_map": {
|
286 |
+
"AutoTokenizer": [
|
287 |
+
"gptx_tokenizer.SPTokenizer",
|
288 |
+
null
|
289 |
+
]
|
290 |
+
},
|
291 |
+
"tokenizer_class": "SPTokenizer"
|
292 |
+
}
|