Add SetFit model
Browse files- .gitattributes +1 -0
- 1_Pooling/config.json +10 -0
- README.md +212 -0
- config.json +29 -0
- config_sentence_transformers.json +10 -0
- config_setfit.json +4 -0
- model.safetensors +3 -0
- model_head.pkl +3 -0
- modules.json +14 -0
- sentence_bert_config.json +4 -0
- sentencepiece.bpe.model +3 -0
- special_tokens_map.json +51 -0
- tokenizer.json +3 -0
- tokenizer_config.json +61 -0
.gitattributes
CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
1_Pooling/config.json
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"word_embedding_dimension": 768,
|
3 |
+
"pooling_mode_cls_token": false,
|
4 |
+
"pooling_mode_mean_tokens": true,
|
5 |
+
"pooling_mode_max_tokens": false,
|
6 |
+
"pooling_mode_mean_sqrt_len_tokens": false,
|
7 |
+
"pooling_mode_weightedmean_tokens": false,
|
8 |
+
"pooling_mode_lasttoken": false,
|
9 |
+
"include_prompt": true
|
10 |
+
}
|
README.md
ADDED
@@ -0,0 +1,212 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
base_model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
|
3 |
+
library_name: setfit
|
4 |
+
metrics:
|
5 |
+
- f1
|
6 |
+
pipeline_tag: text-classification
|
7 |
+
tags:
|
8 |
+
- setfit
|
9 |
+
- sentence-transformers
|
10 |
+
- text-classification
|
11 |
+
- generated_from_setfit_trainer
|
12 |
+
widget:
|
13 |
+
- text: Politically Motivated Murders Increased by 80% in October On November 7, news
|
14 |
+
outlets reported that murders due to political violence in Colombia increased
|
15 |
+
in October by 80%, according to the Resource Center for the Analysis of Conflicts.
|
16 |
+
- text: Ils auraient menacé la femme d’un VDP et fouiller leur avant de repartir avec
|
17 |
+
une arme.
|
18 |
+
- text: En rappel, cette décision de la réouverture des points de vente de céréales
|
19 |
+
au profit des personnes vulnerables pendant le premier trimestre de 2021, a été
|
20 |
+
prise par le Conseil des ministres du 24 février 2021.
|
21 |
+
- text: IRC clinics have seen double the number of patients this month due to increasing
|
22 |
+
pressure on other facilities where there are PPE shortages or a reduction in health
|
23 |
+
staff who have had to self-isolate as a precaution.
|
24 |
+
- text: Según los hallazgos de las instituciones que participaron en la misión recientemente,
|
25 |
+
se conoció que las comunidades que continúan en el resguardo están en riesgo de
|
26 |
+
desplazamiento hacia Montería, debido a la continuidad de combates, operaciones
|
27 |
+
militares y presencia activa del GDO.
|
28 |
+
inference: true
|
29 |
+
model-index:
|
30 |
+
- name: SetFit with sentence-transformers/paraphrase-multilingual-mpnet-base-v2
|
31 |
+
results:
|
32 |
+
- task:
|
33 |
+
type: text-classification
|
34 |
+
name: Text Classification
|
35 |
+
dataset:
|
36 |
+
name: Unknown
|
37 |
+
type: unknown
|
38 |
+
split: test
|
39 |
+
metrics:
|
40 |
+
- type: f1
|
41 |
+
value: 0.7804878048780488
|
42 |
+
name: F1
|
43 |
+
---
|
44 |
+
|
45 |
+
# SetFit with sentence-transformers/paraphrase-multilingual-mpnet-base-v2
|
46 |
+
|
47 |
+
This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) as the Sentence Transformer embedding model. A [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.
|
48 |
+
|
49 |
+
The model has been trained using an efficient few-shot learning technique that involves:
|
50 |
+
|
51 |
+
1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
|
52 |
+
2. Training a classification head with features from the fine-tuned Sentence Transformer.
|
53 |
+
|
54 |
+
## Model Details
|
55 |
+
|
56 |
+
### Model Description
|
57 |
+
- **Model Type:** SetFit
|
58 |
+
- **Sentence Transformer body:** [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2)
|
59 |
+
- **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
|
60 |
+
- **Maximum Sequence Length:** 128 tokens
|
61 |
+
- **Number of Classes:** 2 classes
|
62 |
+
<!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
|
63 |
+
<!-- - **Language:** Unknown -->
|
64 |
+
<!-- - **License:** Unknown -->
|
65 |
+
|
66 |
+
### Model Sources
|
67 |
+
|
68 |
+
- **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
|
69 |
+
- **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
|
70 |
+
- **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
|
71 |
+
|
72 |
+
### Model Labels
|
73 |
+
| Label | Examples |
|
74 |
+
|:------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
75 |
+
| 1 | <ul><li>'Meanwhile in Mentao camp, access had been cut off for more than a year, following a series of attacks.'</li><li>'La Fundación Desarrollo y Paz (Fundepaz) hizo pública este domingo, la amenaza que recibió el municipio de Cumbales, departamento de Nariño, por un supuesto panfleto atribuido al ELN en el que lanzan varias amenazas en contra de la población civil y ordenan restricciones a la movilidad.'</li><li>'Most focal points (82%) continue to report that living conditions have worsened for their communities since the beginning of the pandemic, more so in low-density areas (96%).All focal points in SDF and 94% in GoS areas say this, but in NSAG/TBAF areas, just over half report little change in people’s ability to meet their needs, and only 42% note a deterioration.'</li></ul> |
|
76 |
+
| 0 | <ul><li>'The murders increased from 10 in August to 18 in October.'</li><li>'Cette tendance s’explique par le fait que ces trois (3) régions ont connu des attaques terroristes et en subissent les conséquences de l’insécurité.'</li><li>'In addition to Protection, Child Protection, and GBV focal points, the Protection Sector has also activated its Protection Emergency Response Units to disseminate messages and ensure that the most vulnerable are reached.'</li></ul> |
|
77 |
+
|
78 |
+
## Evaluation
|
79 |
+
|
80 |
+
### Metrics
|
81 |
+
| Label | F1 |
|
82 |
+
|:--------|:-------|
|
83 |
+
| **all** | 0.7805 |
|
84 |
+
|
85 |
+
## Uses
|
86 |
+
|
87 |
+
### Direct Use for Inference
|
88 |
+
|
89 |
+
First install the SetFit library:
|
90 |
+
|
91 |
+
```bash
|
92 |
+
pip install setfit
|
93 |
+
```
|
94 |
+
|
95 |
+
Then you can load this model and run inference.
|
96 |
+
|
97 |
+
```python
|
98 |
+
from setfit import SetFitModel
|
99 |
+
|
100 |
+
# Download from the 🤗 Hub
|
101 |
+
model = SetFitModel.from_pretrained("Sfekih/sentence_independancy_model")
|
102 |
+
# Run inference
|
103 |
+
preds = model("Ils auraient menacé la femme d’un VDP et fouiller leur avant de repartir avec une arme.")
|
104 |
+
```
|
105 |
+
|
106 |
+
<!--
|
107 |
+
### Downstream Use
|
108 |
+
|
109 |
+
*List how someone could finetune this model on their own dataset.*
|
110 |
+
-->
|
111 |
+
|
112 |
+
<!--
|
113 |
+
### Out-of-Scope Use
|
114 |
+
|
115 |
+
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
|
116 |
+
-->
|
117 |
+
|
118 |
+
<!--
|
119 |
+
## Bias, Risks and Limitations
|
120 |
+
|
121 |
+
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
|
122 |
+
-->
|
123 |
+
|
124 |
+
<!--
|
125 |
+
### Recommendations
|
126 |
+
|
127 |
+
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
|
128 |
+
-->
|
129 |
+
|
130 |
+
## Training Details
|
131 |
+
|
132 |
+
### Training Set Metrics
|
133 |
+
| Training set | Min | Median | Max |
|
134 |
+
|:-------------|:----|:--------|:----|
|
135 |
+
| Word count | 3 | 24.4407 | 78 |
|
136 |
+
|
137 |
+
| Label | Training Sample Count |
|
138 |
+
|:------|:----------------------|
|
139 |
+
| 0 | 59 |
|
140 |
+
| 1 | 59 |
|
141 |
+
|
142 |
+
### Training Hyperparameters
|
143 |
+
- batch_size: (32, 32)
|
144 |
+
- num_epochs: (1, 1)
|
145 |
+
- max_steps: -1
|
146 |
+
- sampling_strategy: oversampling
|
147 |
+
- num_iterations: 35
|
148 |
+
- body_learning_rate: (2e-05, 2e-05)
|
149 |
+
- head_learning_rate: 2e-05
|
150 |
+
- loss: CosineSimilarityLoss
|
151 |
+
- distance_metric: cosine_distance
|
152 |
+
- margin: 0.25
|
153 |
+
- end_to_end: False
|
154 |
+
- use_amp: False
|
155 |
+
- warmup_proportion: 0.1
|
156 |
+
- l2_weight: 0.01
|
157 |
+
- seed: 42
|
158 |
+
- eval_max_steps: -1
|
159 |
+
- load_best_model_at_end: False
|
160 |
+
|
161 |
+
### Training Results
|
162 |
+
| Epoch | Step | Training Loss | Validation Loss |
|
163 |
+
|:------:|:----:|:-------------:|:---------------:|
|
164 |
+
| 0.0039 | 1 | 0.2854 | - |
|
165 |
+
| 0.1931 | 50 | 0.2645 | - |
|
166 |
+
| 0.3861 | 100 | 0.0945 | - |
|
167 |
+
| 0.5792 | 150 | 0.0022 | - |
|
168 |
+
| 0.7722 | 200 | 0.0008 | - |
|
169 |
+
| 0.9653 | 250 | 0.0006 | - |
|
170 |
+
|
171 |
+
### Framework Versions
|
172 |
+
- Python: 3.10.12
|
173 |
+
- SetFit: 1.1.0
|
174 |
+
- Sentence Transformers: 3.1.1
|
175 |
+
- Transformers: 4.44.2
|
176 |
+
- PyTorch: 2.4.1+cu121
|
177 |
+
- Datasets: 3.0.1
|
178 |
+
- Tokenizers: 0.19.1
|
179 |
+
|
180 |
+
## Citation
|
181 |
+
|
182 |
+
### BibTeX
|
183 |
+
```bibtex
|
184 |
+
@article{https://doi.org/10.48550/arxiv.2209.11055,
|
185 |
+
doi = {10.48550/ARXIV.2209.11055},
|
186 |
+
url = {https://arxiv.org/abs/2209.11055},
|
187 |
+
author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
|
188 |
+
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
|
189 |
+
title = {Efficient Few-Shot Learning Without Prompts},
|
190 |
+
publisher = {arXiv},
|
191 |
+
year = {2022},
|
192 |
+
copyright = {Creative Commons Attribution 4.0 International}
|
193 |
+
}
|
194 |
+
```
|
195 |
+
|
196 |
+
<!--
|
197 |
+
## Glossary
|
198 |
+
|
199 |
+
*Clearly define terms in order to be accessible across audiences.*
|
200 |
+
-->
|
201 |
+
|
202 |
+
<!--
|
203 |
+
## Model Card Authors
|
204 |
+
|
205 |
+
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
|
206 |
+
-->
|
207 |
+
|
208 |
+
<!--
|
209 |
+
## Model Card Contact
|
210 |
+
|
211 |
+
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
|
212 |
+
-->
|
config.json
ADDED
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
|
3 |
+
"architectures": [
|
4 |
+
"XLMRobertaModel"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"bos_token_id": 0,
|
8 |
+
"classifier_dropout": null,
|
9 |
+
"eos_token_id": 2,
|
10 |
+
"gradient_checkpointing": false,
|
11 |
+
"hidden_act": "gelu",
|
12 |
+
"hidden_dropout_prob": 0.1,
|
13 |
+
"hidden_size": 768,
|
14 |
+
"initializer_range": 0.02,
|
15 |
+
"intermediate_size": 3072,
|
16 |
+
"layer_norm_eps": 1e-05,
|
17 |
+
"max_position_embeddings": 514,
|
18 |
+
"model_type": "xlm-roberta",
|
19 |
+
"num_attention_heads": 12,
|
20 |
+
"num_hidden_layers": 12,
|
21 |
+
"output_past": true,
|
22 |
+
"pad_token_id": 1,
|
23 |
+
"position_embedding_type": "absolute",
|
24 |
+
"torch_dtype": "float32",
|
25 |
+
"transformers_version": "4.44.2",
|
26 |
+
"type_vocab_size": 1,
|
27 |
+
"use_cache": true,
|
28 |
+
"vocab_size": 250002
|
29 |
+
}
|
config_sentence_transformers.json
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"__version__": {
|
3 |
+
"sentence_transformers": "3.1.1",
|
4 |
+
"transformers": "4.44.2",
|
5 |
+
"pytorch": "2.4.1+cu121"
|
6 |
+
},
|
7 |
+
"prompts": {},
|
8 |
+
"default_prompt_name": null,
|
9 |
+
"similarity_fn_name": null
|
10 |
+
}
|
config_setfit.json
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"labels": null,
|
3 |
+
"normalize_embeddings": false
|
4 |
+
}
|
model.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:947addd35bd056ba5d341ece63ea1fbd86968e47c186d47750f829202716a775
|
3 |
+
size 1112197096
|
model_head.pkl
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:64a1325c4ba131c26e47376887312c005bcb3e827e60eadef8ec4ff00e3828bb
|
3 |
+
size 7007
|
modules.json
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[
|
2 |
+
{
|
3 |
+
"idx": 0,
|
4 |
+
"name": "0",
|
5 |
+
"path": "",
|
6 |
+
"type": "sentence_transformers.models.Transformer"
|
7 |
+
},
|
8 |
+
{
|
9 |
+
"idx": 1,
|
10 |
+
"name": "1",
|
11 |
+
"path": "1_Pooling",
|
12 |
+
"type": "sentence_transformers.models.Pooling"
|
13 |
+
}
|
14 |
+
]
|
sentence_bert_config.json
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"max_seq_length": 128,
|
3 |
+
"do_lower_case": false
|
4 |
+
}
|
sentencepiece.bpe.model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
|
3 |
+
size 5069051
|
special_tokens_map.json
ADDED
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"bos_token": {
|
3 |
+
"content": "<s>",
|
4 |
+
"lstrip": false,
|
5 |
+
"normalized": false,
|
6 |
+
"rstrip": false,
|
7 |
+
"single_word": false
|
8 |
+
},
|
9 |
+
"cls_token": {
|
10 |
+
"content": "<s>",
|
11 |
+
"lstrip": false,
|
12 |
+
"normalized": false,
|
13 |
+
"rstrip": false,
|
14 |
+
"single_word": false
|
15 |
+
},
|
16 |
+
"eos_token": {
|
17 |
+
"content": "</s>",
|
18 |
+
"lstrip": false,
|
19 |
+
"normalized": false,
|
20 |
+
"rstrip": false,
|
21 |
+
"single_word": false
|
22 |
+
},
|
23 |
+
"mask_token": {
|
24 |
+
"content": "<mask>",
|
25 |
+
"lstrip": true,
|
26 |
+
"normalized": false,
|
27 |
+
"rstrip": false,
|
28 |
+
"single_word": false
|
29 |
+
},
|
30 |
+
"pad_token": {
|
31 |
+
"content": "<pad>",
|
32 |
+
"lstrip": false,
|
33 |
+
"normalized": false,
|
34 |
+
"rstrip": false,
|
35 |
+
"single_word": false
|
36 |
+
},
|
37 |
+
"sep_token": {
|
38 |
+
"content": "</s>",
|
39 |
+
"lstrip": false,
|
40 |
+
"normalized": false,
|
41 |
+
"rstrip": false,
|
42 |
+
"single_word": false
|
43 |
+
},
|
44 |
+
"unk_token": {
|
45 |
+
"content": "<unk>",
|
46 |
+
"lstrip": false,
|
47 |
+
"normalized": false,
|
48 |
+
"rstrip": false,
|
49 |
+
"single_word": false
|
50 |
+
}
|
51 |
+
}
|
tokenizer.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:cad551d5600a84242d0973327029452a1e3672ba6313c2a3c3d69c4310e12719
|
3 |
+
size 17082987
|
tokenizer_config.json
ADDED
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"added_tokens_decoder": {
|
3 |
+
"0": {
|
4 |
+
"content": "<s>",
|
5 |
+
"lstrip": false,
|
6 |
+
"normalized": false,
|
7 |
+
"rstrip": false,
|
8 |
+
"single_word": false,
|
9 |
+
"special": true
|
10 |
+
},
|
11 |
+
"1": {
|
12 |
+
"content": "<pad>",
|
13 |
+
"lstrip": false,
|
14 |
+
"normalized": false,
|
15 |
+
"rstrip": false,
|
16 |
+
"single_word": false,
|
17 |
+
"special": true
|
18 |
+
},
|
19 |
+
"2": {
|
20 |
+
"content": "</s>",
|
21 |
+
"lstrip": false,
|
22 |
+
"normalized": false,
|
23 |
+
"rstrip": false,
|
24 |
+
"single_word": false,
|
25 |
+
"special": true
|
26 |
+
},
|
27 |
+
"3": {
|
28 |
+
"content": "<unk>",
|
29 |
+
"lstrip": false,
|
30 |
+
"normalized": false,
|
31 |
+
"rstrip": false,
|
32 |
+
"single_word": false,
|
33 |
+
"special": true
|
34 |
+
},
|
35 |
+
"250001": {
|
36 |
+
"content": "<mask>",
|
37 |
+
"lstrip": true,
|
38 |
+
"normalized": false,
|
39 |
+
"rstrip": false,
|
40 |
+
"single_word": false,
|
41 |
+
"special": true
|
42 |
+
}
|
43 |
+
},
|
44 |
+
"bos_token": "<s>",
|
45 |
+
"clean_up_tokenization_spaces": true,
|
46 |
+
"cls_token": "<s>",
|
47 |
+
"eos_token": "</s>",
|
48 |
+
"mask_token": "<mask>",
|
49 |
+
"max_length": 128,
|
50 |
+
"model_max_length": 128,
|
51 |
+
"pad_to_multiple_of": null,
|
52 |
+
"pad_token": "<pad>",
|
53 |
+
"pad_token_type_id": 0,
|
54 |
+
"padding_side": "right",
|
55 |
+
"sep_token": "</s>",
|
56 |
+
"stride": 0,
|
57 |
+
"tokenizer_class": "XLMRobertaTokenizer",
|
58 |
+
"truncation_side": "right",
|
59 |
+
"truncation_strategy": "longest_first",
|
60 |
+
"unk_token": "<unk>"
|
61 |
+
}
|