davidadamczyk commited on
Commit
c6fa063
1 Parent(s): 5b671b0

Add SetFit model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,254 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: sentence-transformers/all-mpnet-base-v2
3
+ library_name: setfit
4
+ metrics:
5
+ - accuracy
6
+ pipeline_tag: text-classification
7
+ tags:
8
+ - setfit
9
+ - sentence-transformers
10
+ - text-classification
11
+ - generated_from_setfit_trainer
12
+ widget:
13
+ - text: 'Alexis it Doesn’t Have To End Georgiawas invaded by Russia and lost its territoryof
14
+ Ossetia and Abkhazia. What did USAdo? It condemned the invasion by issuinga statement.
15
+ George Bush and Putin, bothguests at Beijing Olympic opening ceremony,argued.
16
+ Georgia appreciates.
17
+
18
+ '
19
+ - text: 'DLI believe she also married Aristotle Onassis, who owned the world''s largest
20
+ private shipping fleet -- that may have helped finance her other life choices...
21
+
22
+ '
23
+ - text: 'Remember watching this movie with my wife as newly weds in 1995. Wonderful
24
+ evergreen film. Shahrukh was the son every father wants. And every girl wants
25
+ as a boyfriend or husband. True love. The relationship between Anupam Kher and
26
+ his son Shahrukh is pleasant and different than usual Punjabi father-son distant
27
+ relationships. Music is beautiful! My children love this movie as well. I could
28
+ watch it anytime-does not seem old or dated. Thank you Yash Chopra, Aditya Chopra,
29
+ Shahrukh, Kajol and all of the team who brought us this beautiful human drama!
30
+
31
+ '
32
+ - text: 'In the photo of the D''Alesandro family with Pres. Kennedy, I think it is
33
+ telling that Mrs. D''Alesandro is doing the "adoring" look at Mr. D''Alesandro.
34
+ Par for the course for a 1961 pol''s wife.Meanwhile their 21-year-old daughter
35
+ Nancy already has her piercing eyes unabashedly fixed right on Kennedy. You can
36
+ almost see her thinking, "This powerful man can do great things for the country.
37
+ How do I get there?"And she did get there -- to within a couple heartbeats of
38
+ the Presidency, and arguably a position far more powerful and effective over her
39
+ career than if she''d taken a term in the White House.
40
+
41
+ '
42
+ - text: 'Why is it that grown men feel free to do these sorts of things to young girls
43
+ and that societies tolerate it? Why is the girl the one who is put on trial instead
44
+ of the man/men who are responsible for what they did to her? Why is her life
45
+ ruined? Why are women forced to prove their virtue over and over after they''ve
46
+ been sexually assaulted by a husband, a relative, a male friend, or a stranger? The
47
+ worst of all is that the girls, who are too young to marry, can still become pregnant
48
+ and be forced to carry the pregnancy to term. What does it do to both the children
49
+ when one is the result of rape? How does one deal with a child who exists through
50
+ no fault of its own? We know this happens all over the world. It happens here
51
+ too. Even if we''re a rich country and have "enlightened" attitudes, when we
52
+ deny women of any age the right to control their reproductive lives, we are showing
53
+ exactly how little we think of women. On a personal note, my parents didn''t
54
+ want to have me when they did. When I was 16 my mother told me, in a fit of anger,
55
+ that if it weren''t for the abortion laws (in the 1950s) I wouldn''t be here. But
56
+ I was not a child of rape. I can''t imagine how that feels for the victim or
57
+ the child (who is also a victim). Is the answer education for both boys and girls? Or
58
+ is it forcing a real change in the attitudes societies have towards half of their
59
+ population, the half that does much of the caring, loving, and raising of children?
60
+
61
+ '
62
+ inference: true
63
+ model-index:
64
+ - name: SetFit with sentence-transformers/all-mpnet-base-v2
65
+ results:
66
+ - task:
67
+ type: text-classification
68
+ name: Text Classification
69
+ dataset:
70
+ name: Unknown
71
+ type: unknown
72
+ split: test
73
+ metrics:
74
+ - type: accuracy
75
+ value: 0.9
76
+ name: Accuracy
77
+ ---
78
+
79
+ # SetFit with sentence-transformers/all-mpnet-base-v2
80
+
81
+ This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) as the Sentence Transformer embedding model. A [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.
82
+
83
+ The model has been trained using an efficient few-shot learning technique that involves:
84
+
85
+ 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
86
+ 2. Training a classification head with features from the fine-tuned Sentence Transformer.
87
+
88
+ ## Model Details
89
+
90
+ ### Model Description
91
+ - **Model Type:** SetFit
92
+ - **Sentence Transformer body:** [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
93
+ - **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
94
+ - **Maximum Sequence Length:** 384 tokens
95
+ - **Number of Classes:** 2 classes
96
+ <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
97
+ <!-- - **Language:** Unknown -->
98
+ <!-- - **License:** Unknown -->
99
+
100
+ ### Model Sources
101
+
102
+ - **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
103
+ - **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
104
+ - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
105
+
106
+ ### Model Labels
107
+ | Label | Examples |
108
+ |:------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
109
+ | yes | <ul><li>'There is an epic, romantic story between Daniel Barenboim and Jacqueline du Pré (one of the greatest cellists of all time) that goes back to the late 1960’s. She was a disciple of the great Russian cellist Mstislav Rostropovich, who was so impressed with her immense talent that he viewed the much younger Ms. du Pré as his equal and successor.On Christmas Eve of 1966 Jacqueline du Pré met Daniel Barenboim in London, promptly converted to Judaism and married him in Israel in 1967. They went on to record exquisite music together and thus became “the golden couple” of classical music at that time.For all the romantics out there, they left a trail of recordings which includes what I consider the best-ever performance of Robert Schumann’s Cello Concerto. The combination of the young Barenboim and du Pré, both not yet 30 years old, and Schumann, the great romantic, was stunning. The cello (a 1712 Stradivarius) seemed to come alive, speaking directly to the heart, Baremboim was equally impeccable, and we all cried from beauty so sublime. I am now 84, and still get misty when I play it.Tragically, du Pré died at the young age of 42, making this chapter of Mr. Baremboim’s life incredibly poignant. The recording lives on and is still available.\n'</li><li>'Santos was once married to a woman, despite being gay. Did he do that to obtain American citizenship?He received campaign money from a businessman, Andrew Intrater, who cultivated close links with a onetime Trump confidant and who is the cousin of a sanctioned Russian oligarch, Russian billionaire Viktor Vekselberg, who has been sanctioned by the U.S. government for his role in the Russian energy industry. according to video footage and court documents.Harbor City, the company Santos worked for and is under investigation for a money scheme, was able to land a $625,000 deposit from a company registered in Mississippi that identifies Intrater as its lone officer, according to an exhibit included in the SEC’s complaint against Harbor City.After Harbor City’s assets were frozen, and with assistance from a fellow former Harbor City employee, Santos in 2021 formed a company, the Devolder Organization, that paid him at least $3.5 million over the next two years, according to Florida business records and financial disclosure forms he filed as a candidate. Santos loaned his campaign more than $700,000 but did not report any income from Harbor City despite having been paid by the company as recently as April 2021.Did that money come from Harbor City’s ponzu scheme or did it come from Russia through Intrater and is Santos in the pocket of Russia?Lots we don’t know, lots to investigate.\n'</li><li>"Yes, indeed, making close friends at work is a wonderful idea. I met a woman at work 48 years ago and we became great friends. She and her husband invited me to dinner one evening to meet an engineer who worked with her husband. They both thought we might like each other. They were certainly right about that. We were engaged 3 months later and married three months after that. We'll be celebrating our 47th wedding anniversary the end of this month. Yup, close friends at work can be wonderful!\n"</li></ul> |
110
+ | no | <ul><li>'Not surprisingly, this is one of the most astute columns I\'ve read recently about the ubiquity of guns in America and lack of common sense gun control laws. I\'ve experienced a situation where I saw a guy with a holstered gun on his hip walking toward the entry of a grocery where I was intending to go. (There was no indication at all that he was a member of law enforcement.) His whole posture was one of intimidation and when I perceived that I turned right around and left for a different store. Was my reaction fear? Instinctively it certainly was, so I took precaution. And as Bouie points out, I was deprived of my freedom: my choice and ability to shop at that store without fear, and so a forced resignation and imposed requirement that I change my shopping plans. (I think it\'s noteworthy too that the only people I\'ve seen open carry have all been white men. I\'ve never seen a black man open carry or a hispanic man, nor a woman. I think we probably know why: racism. If a black man walked into a store with a gun on his hip, in this country, he would immediately cause panic.)There is no reason why anyone needs to open carry in a public space unless they are law enforcement.Jokes have been made about the hubris of "duck & cover" drills from the 1950s-60s because of threat of nuclear war. Gun proliferation in America causes more death & greater threat to society than the possibility of nuclear war. The 2nd amendment needs to be amended to reflect common sense gun laws.\n'</li><li>'"At the same time, 45 percent said the pornography provided helpful information about sex. L.G.B.T.Q. teenagers, in particular, said it helped them discover more about their sexuality.“\'We have to be careful about saying all porn is good or bad,\' said Emily Rothman, a professor of community health sciences at Boston University. \'There is nuance here.\'”Gross. Somehow, since the beginning of time, young people, especially LGBTQ teens, have managed to discover more about their sexuality without themselves or all of us being inundated with pornography--and what we see today is not just porn but ubiquitous violence. Attitudes like Rothman\'s are why parents are fighting against school libraries offering sexuality explicit books about LGBTQ teens. You won\'t find sexually explicit books about straight sex in those libraries. There\'s no library market for those books. In the name of helping LGBTQ kids "discover" their sexuality, librarians and teachers justify exposing all teens to porn. Too much porn is too much porn. Because of all the porn, girls think it\'s normal for their boyfriends to choke them. Boys masterbate so often that they damage their brains\' abilities to regulate pleasure and wind up impotent. The normalization of porn has negatively impacted how younger people see relationships and marriage. Too much porn has also damaged how girls see themselves as embodied females.Enough. Justifying porn for teens as a tool for discovering sexuality hurts all teens.\n'</li><li>'CT1001 I hope that\'s not a rhetorical question, expecting "you don\'t" for an answer. Because people are doing it. Existing written records can reveal more than they ever intended about the lives of the oppressed... oral material can be looked at seriously... and "archeology" can merge smoothly into history if it involves, for instance, paying as much attention to the remnants of slave quarters, as to the slave-owners quarters... it\'s very appropriate to accuse the people who disappeared the slave quarters, while prettying up the owners residence as an attractive venue for weddings etc, during the hundred years of historical erasure that went on in this country.\n'</li></ul> |
111
+
112
+ ## Evaluation
113
+
114
+ ### Metrics
115
+ | Label | Accuracy |
116
+ |:--------|:---------|
117
+ | **all** | 0.9 |
118
+
119
+ ## Uses
120
+
121
+ ### Direct Use for Inference
122
+
123
+ First install the SetFit library:
124
+
125
+ ```bash
126
+ pip install setfit
127
+ ```
128
+
129
+ Then you can load this model and run inference.
130
+
131
+ ```python
132
+ from setfit import SetFitModel
133
+
134
+ # Download from the 🤗 Hub
135
+ model = SetFitModel.from_pretrained("davidadamczyk/setfit-model-9")
136
+ # Run inference
137
+ preds = model("DLI believe she also married Aristotle Onassis, who owned the world's largest private shipping fleet -- that may have helped finance her other life choices...
138
+ ")
139
+ ```
140
+
141
+ <!--
142
+ ### Downstream Use
143
+
144
+ *List how someone could finetune this model on their own dataset.*
145
+ -->
146
+
147
+ <!--
148
+ ### Out-of-Scope Use
149
+
150
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
151
+ -->
152
+
153
+ <!--
154
+ ## Bias, Risks and Limitations
155
+
156
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
157
+ -->
158
+
159
+ <!--
160
+ ### Recommendations
161
+
162
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
163
+ -->
164
+
165
+ ## Training Details
166
+
167
+ ### Training Set Metrics
168
+ | Training set | Min | Median | Max |
169
+ |:-------------|:----|:-------|:----|
170
+ | Word count | 37 | 170.9 | 276 |
171
+
172
+ | Label | Training Sample Count |
173
+ |:------|:----------------------|
174
+ | no | 18 |
175
+ | yes | 22 |
176
+
177
+ ### Training Hyperparameters
178
+ - batch_size: (16, 16)
179
+ - num_epochs: (1, 1)
180
+ - max_steps: -1
181
+ - sampling_strategy: oversampling
182
+ - num_iterations: 120
183
+ - body_learning_rate: (2e-05, 2e-05)
184
+ - head_learning_rate: 2e-05
185
+ - loss: CosineSimilarityLoss
186
+ - distance_metric: cosine_distance
187
+ - margin: 0.25
188
+ - end_to_end: False
189
+ - use_amp: False
190
+ - warmup_proportion: 0.1
191
+ - l2_weight: 0.01
192
+ - seed: 42
193
+ - eval_max_steps: -1
194
+ - load_best_model_at_end: False
195
+
196
+ ### Training Results
197
+ | Epoch | Step | Training Loss | Validation Loss |
198
+ |:------:|:----:|:-------------:|:---------------:|
199
+ | 0.0017 | 1 | 0.5127 | - |
200
+ | 0.0833 | 50 | 0.2133 | - |
201
+ | 0.1667 | 100 | 0.0057 | - |
202
+ | 0.25 | 150 | 0.0002 | - |
203
+ | 0.3333 | 200 | 0.0001 | - |
204
+ | 0.4167 | 250 | 0.0001 | - |
205
+ | 0.5 | 300 | 0.0001 | - |
206
+ | 0.5833 | 350 | 0.0001 | - |
207
+ | 0.6667 | 400 | 0.0001 | - |
208
+ | 0.75 | 450 | 0.0001 | - |
209
+ | 0.8333 | 500 | 0.0001 | - |
210
+ | 0.9167 | 550 | 0.0 | - |
211
+ | 1.0 | 600 | 0.0 | - |
212
+
213
+ ### Framework Versions
214
+ - Python: 3.10.13
215
+ - SetFit: 1.1.0
216
+ - Sentence Transformers: 3.0.1
217
+ - Transformers: 4.45.2
218
+ - PyTorch: 2.4.0+cu124
219
+ - Datasets: 2.21.0
220
+ - Tokenizers: 0.20.0
221
+
222
+ ## Citation
223
+
224
+ ### BibTeX
225
+ ```bibtex
226
+ @article{https://doi.org/10.48550/arxiv.2209.11055,
227
+ doi = {10.48550/ARXIV.2209.11055},
228
+ url = {https://arxiv.org/abs/2209.11055},
229
+ author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
230
+ keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
231
+ title = {Efficient Few-Shot Learning Without Prompts},
232
+ publisher = {arXiv},
233
+ year = {2022},
234
+ copyright = {Creative Commons Attribution 4.0 International}
235
+ }
236
+ ```
237
+
238
+ <!--
239
+ ## Glossary
240
+
241
+ *Clearly define terms in order to be accessible across audiences.*
242
+ -->
243
+
244
+ <!--
245
+ ## Model Card Authors
246
+
247
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
248
+ -->
249
+
250
+ <!--
251
+ ## Model Card Contact
252
+
253
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
254
+ -->
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "sentence-transformers/all-mpnet-base-v2",
3
+ "architectures": [
4
+ "MPNetModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "mpnet",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 1,
20
+ "relative_attention_num_buckets": 32,
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.45.2",
23
+ "vocab_size": 30527
24
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.0.1",
4
+ "transformers": "4.45.2",
5
+ "pytorch": "2.4.0+cu124"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
config_setfit.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "normalize_embeddings": false,
3
+ "labels": [
4
+ "no",
5
+ "yes"
6
+ ]
7
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:74d26fd50cbee2bee839ecdbadbbb6d26ba029db581f026d424815af7cad1d29
3
+ size 437967672
model_head.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dbe979e5a702ad716a456552e03da527b0a8b19510baed312ff485681f79f7f6
3
+ size 7023
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 384,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "[UNK]",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "104": {
36
+ "content": "[UNK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "30526": {
44
+ "content": "<mask>",
45
+ "lstrip": true,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ }
51
+ },
52
+ "bos_token": "<s>",
53
+ "clean_up_tokenization_spaces": false,
54
+ "cls_token": "<s>",
55
+ "do_lower_case": true,
56
+ "eos_token": "</s>",
57
+ "mask_token": "<mask>",
58
+ "max_length": 128,
59
+ "model_max_length": 384,
60
+ "pad_to_multiple_of": null,
61
+ "pad_token": "<pad>",
62
+ "pad_token_type_id": 0,
63
+ "padding_side": "right",
64
+ "sep_token": "</s>",
65
+ "stride": 0,
66
+ "strip_accents": null,
67
+ "tokenize_chinese_chars": true,
68
+ "tokenizer_class": "MPNetTokenizer",
69
+ "truncation_side": "right",
70
+ "truncation_strategy": "longest_first",
71
+ "unk_token": "[UNK]"
72
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff