davidadamczyk commited on
Commit
27ed5ec
1 Parent(s): 20e5251

Add SetFit model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: sentence-transformers/all-mpnet-base-v2
3
+ library_name: setfit
4
+ metrics:
5
+ - accuracy
6
+ pipeline_tag: text-classification
7
+ tags:
8
+ - setfit
9
+ - sentence-transformers
10
+ - text-classification
11
+ - generated_from_setfit_trainer
12
+ widget:
13
+ - text: 'It might have been more fun for everyone if the Thruway Authority had given
14
+ individual contracts for each rest stop, with the stipulation that each reflect
15
+ some local regional character. This could interest travelers to maybe get off
16
+ at the next exit and explore some local places. With every stop the same, the
17
+ traveler might as well be in Kansas.
18
+
19
+ '
20
+ - text: 'I was scammed by a fake retailer appearing on a Google search for a popular
21
+ product, a Patagonia backpack, offered at a significant discount. The website
22
+ seemed legitimate; I was given a choice of colors and sizes. The scammer provided
23
+ a tracking number from China. I have bought discounted items before from China
24
+ that are sold on eBay and are sent by Chinese parcel post, for which tracking
25
+ information is scant. When whatever item that was mailed finally arrived at a
26
+ completely different address in another state several weeks later, I alerted my
27
+ credit card company of the fraud and was refunded the amount, despite the time
28
+ frame it took to determine the scam.
29
+
30
+ '
31
+ - text: 'From Matt Stoller''s newsletter (edited for flow):LastPass was purchased
32
+ by two private equity firms, Francisco Partners and Evergreen Coast Capital Corp.
33
+ Typically, PE firms raise prices, lower quality, harm workers, and reduce customer
34
+ service. They then decided to charge customers $36 to access the cumbersome passwords.
35
+ This particular pricing move sparked a backlash from customers, and the two PE
36
+ firms pledged to spin off the company and make it independent. But that hasn’t
37
+ happened.Poor quality is common within private equity owned software firms, which
38
+ means cybersecurity vulnerabilities quickly follow. We’ve seen this with PE-owned
39
+ software firms facilitating the hacking of the NYC subway, nuclear weapons facilities,
40
+ and criminal ransomware. And now it’s happened with LastPass. Lovely.
41
+
42
+ '
43
+ - text: 'Maybe for the ''come latelies'' this is a big storm, but for folks who have
44
+ lived there, this is not something new.When El Nino dumps in the Sierras...THAT,
45
+ is a snow Storm! In ''82-83 the area near Squaw Valley got 800 inches! ''Dumps''
46
+ of 4-6+ feet happened about about 2x a month...we were living like snow moles,
47
+ mimicing the great snow storms of the early 20th century - you may have seen these
48
+ in historical photos.Homeowners were shoveling 3-5 feet of snow off their roofs,
49
+ to prevent total collapse!We always had a good hearty Laugh at those CA flatlanders,
50
+ driving to Tahoe on I 80 in the ''rain'' ties, with flakes like silver dollars,
51
+ blotting out visibility.Remember, was it last winter when I 80 was closed and
52
+ all the hip techies turned to their google maps and ended up on closed roads,
53
+ in the boondocks? Like I 80 is closed and some 1 1/2 rural lane road, was going
54
+ to be OPEN??? Hellarious!Of course, down in the flatlands, we''ve seen how folks
55
+ THOUGHT they had ''amphibious'' cars...Any idea how folks became so....lame? (BTW:
56
+ Mt Baker in WA has the record of 1100 inches of snow....keeping the smaller Mt
57
+ St Helens-like volcano, sleeping!) Winter is great, if you respect Mother Nature;
58
+ soooo many havent a clue, putting 1st Responders, at great risk! And 4 wheel drive,
59
+ CAN keep you going straight, at a CAUTIOUS speed...not good, on icy curves!!
60
+
61
+ '
62
+ - text: 'Ethan. The results of that great agricultural revolution are in and not much
63
+ of it is admirable. More Food = More People = More Fossil Fuels = More Toxic Pollution
64
+ = More Disease = More Greenhouse Gases = More Climate Change = end-of-the-line.
65
+ Human population was able to grow as rapidly as fossil fuel inputs were increased.
66
+ But now, we must reduce usage of fossil fuels and the resulting population logically
67
+ goes in the same direction. All the green technologies are for naught. It comes
68
+ down to fossil fuels.
69
+
70
+ '
71
+ inference: true
72
+ model-index:
73
+ - name: SetFit with sentence-transformers/all-mpnet-base-v2
74
+ results:
75
+ - task:
76
+ type: text-classification
77
+ name: Text Classification
78
+ dataset:
79
+ name: Unknown
80
+ type: unknown
81
+ split: test
82
+ metrics:
83
+ - type: accuracy
84
+ value: 0.8
85
+ name: Accuracy
86
+ ---
87
+
88
+ # SetFit with sentence-transformers/all-mpnet-base-v2
89
+
90
+ This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) as the Sentence Transformer embedding model. A [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.
91
+
92
+ The model has been trained using an efficient few-shot learning technique that involves:
93
+
94
+ 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
95
+ 2. Training a classification head with features from the fine-tuned Sentence Transformer.
96
+
97
+ ## Model Details
98
+
99
+ ### Model Description
100
+ - **Model Type:** SetFit
101
+ - **Sentence Transformer body:** [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
102
+ - **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
103
+ - **Maximum Sequence Length:** 384 tokens
104
+ - **Number of Classes:** 2 classes
105
+ <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
106
+ <!-- - **Language:** Unknown -->
107
+ <!-- - **License:** Unknown -->
108
+
109
+ ### Model Sources
110
+
111
+ - **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
112
+ - **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
113
+ - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
114
+
115
+ ### Model Labels
116
+ | Label | Examples |
117
+ |:------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
118
+ | yes | <ul><li>"Amazon you so carefully forget that the AWS cloud system is Amazon's big revenue stream and it is part of the retail side. Not clever at all to call it a 'book store' but then you are not a techie of any kind!\n"</li><li>"Happened to me with On Cloud, but I didn't see an ad but rather I typed in the shoe name in Google and a site came up that looked right. Thankfully I used PayPal and when the vendor came up as a person's name, I immediately knew it was wrong. PayPal was great about refunding me the funds after an investigation.\n"</li><li>"ZR It won't be long before people lose ownership entirely of all of their digital content, including their precious family photos, as the digital storage devices and formats will be gone. The plan is to have exclusive Cloud storage and streaming, which everyone will pay for and not own. I regret digitizing the majority of my old photographs, although I did it to preserve those very old ones (family history photos over 100 years old) because they were beginning to fade. As for slides, where can one even get them hand-developed anymore? The machines used for slides are terrible.\n"</li></ul> |
119
+ | no | <ul><li>"Have you considered how data science can increase agricultural yields? How computer science makes surgeries safer? How a bunch of programmers created the platform that you're commenting on right now? someone creates and updates the software architects used to do design. Someone, somewhere uses a computer program to design more stable supply chains. Do I think hands-on work is important? Absolutely! But don't act like all those programmers are out there just working for big tech and banks. They work in manufacturing, in agriculture, in engineering, etc. doing real work that benefits real people in quantifiable ways.\n"</li><li>'Huh. A white guy "hailed as his era\'s most brilliant and influential chef(s)" uses unpaid labor whom he regularly physically and verbally abuses. This reeks of modern-day indentured servitude. I don\'t hesitate to use this term because too many of the staffers believe they must compete this way to obtain future employment.The deck is stacked in several ways. About being deified as a luminary these days, and I\'ve seen it so many times, especially from a gushing, slavering media: heck yeah! If you don\'t pay your staff any wage, or minimal wage (restaurants! universities! really awesome internships!), force them to work twice as many hours as they should (Twitter under Musk, perhaps), THEN OF COURSE YOU WILL AMAZE EVERYONE WITH YOUR OUTPUT. You will probably outperform other people in your field, who wouldn\'t dream of behaving in such a sordid/criminal manner. Your staff will be frightened of you. You will have a much better operating budget than, say, a boss/owner who believes in ACTUALLY PAYING PEOPLE. Stop drooling over these people with monstrous egos and no concern for the workers who make them successful.\n'</li><li>'That\'s the name of the book? "Where is My Flying Car?"I know it\'s a metaphor, but oh, brother.Some years back before I got too old to seriously contemplate taking flying lessons, there was a promising \'flying car\' in development. It was well on its way to market.I imagined flying from my local airport to my brother\'s local airport making what is always a four hour driving trip about an hour and a half. Yes, that would be lovely.Now let\'s think about logistics, something these \'visionaries\' never think about. Why do they never think about these things? Because it\'s hard and all they want is their plane-car. Everyone needs to get out of the way of their plane-car. They never want to acknowledge the mayhem resulting from parts and pieces of plane-cars dropping out of the sky, people\'s peace and quiet destroyed by plane-cars flying overhead all the time, etc., etc.And, they\'d be the first to complain that "the government needs to do something about this immediately!"I worked in commercial nuclear power as a youth. We could have made it \'safe enough,\' but that would have taken a massive amount of international cooperation. We also needed oil companies to accept the need for change. We also needed people to accept \'safe enough.\'I am a scientist and I lean progressive (a la Bernie.) Please do not label me "ergophobic." It is not people like me who brought us here.I don\'t know who you are trying to convince or even what your point is.\n'</li></ul> |
120
+
121
+ ## Evaluation
122
+
123
+ ### Metrics
124
+ | Label | Accuracy |
125
+ |:--------|:---------|
126
+ | **all** | 0.8 |
127
+
128
+ ## Uses
129
+
130
+ ### Direct Use for Inference
131
+
132
+ First install the SetFit library:
133
+
134
+ ```bash
135
+ pip install setfit
136
+ ```
137
+
138
+ Then you can load this model and run inference.
139
+
140
+ ```python
141
+ from setfit import SetFitModel
142
+
143
+ # Download from the 🤗 Hub
144
+ model = SetFitModel.from_pretrained("davidadamczyk/setfit-model-3")
145
+ # Run inference
146
+ preds = model("It might have been more fun for everyone if the Thruway Authority had given individual contracts for each rest stop, with the stipulation that each reflect some local regional character. This could interest travelers to maybe get off at the next exit and explore some local places. With every stop the same, the traveler might as well be in Kansas.
147
+ ")
148
+ ```
149
+
150
+ <!--
151
+ ### Downstream Use
152
+
153
+ *List how someone could finetune this model on their own dataset.*
154
+ -->
155
+
156
+ <!--
157
+ ### Out-of-Scope Use
158
+
159
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
160
+ -->
161
+
162
+ <!--
163
+ ## Bias, Risks and Limitations
164
+
165
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
166
+ -->
167
+
168
+ <!--
169
+ ### Recommendations
170
+
171
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
172
+ -->
173
+
174
+ ## Training Details
175
+
176
+ ### Training Set Metrics
177
+ | Training set | Min | Median | Max |
178
+ |:-------------|:----|:-------|:----|
179
+ | Word count | 43 | 140.9 | 262 |
180
+
181
+ | Label | Training Sample Count |
182
+ |:------|:----------------------|
183
+ | no | 18 |
184
+ | yes | 22 |
185
+
186
+ ### Training Hyperparameters
187
+ - batch_size: (16, 16)
188
+ - num_epochs: (1, 1)
189
+ - max_steps: -1
190
+ - sampling_strategy: oversampling
191
+ - num_iterations: 120
192
+ - body_learning_rate: (2e-05, 2e-05)
193
+ - head_learning_rate: 2e-05
194
+ - loss: CosineSimilarityLoss
195
+ - distance_metric: cosine_distance
196
+ - margin: 0.25
197
+ - end_to_end: False
198
+ - use_amp: False
199
+ - warmup_proportion: 0.1
200
+ - l2_weight: 0.01
201
+ - seed: 42
202
+ - eval_max_steps: -1
203
+ - load_best_model_at_end: False
204
+
205
+ ### Training Results
206
+ | Epoch | Step | Training Loss | Validation Loss |
207
+ |:------:|:----:|:-------------:|:---------------:|
208
+ | 0.0017 | 1 | 0.4637 | - |
209
+ | 0.0833 | 50 | 0.2019 | - |
210
+ | 0.1667 | 100 | 0.0063 | - |
211
+ | 0.25 | 150 | 0.0003 | - |
212
+ | 0.3333 | 200 | 0.0002 | - |
213
+ | 0.4167 | 250 | 0.0001 | - |
214
+ | 0.5 | 300 | 0.0001 | - |
215
+ | 0.5833 | 350 | 0.0001 | - |
216
+ | 0.6667 | 400 | 0.0001 | - |
217
+ | 0.75 | 450 | 0.0001 | - |
218
+ | 0.8333 | 500 | 0.0001 | - |
219
+ | 0.9167 | 550 | 0.0001 | - |
220
+ | 1.0 | 600 | 0.0001 | - |
221
+
222
+ ### Framework Versions
223
+ - Python: 3.10.13
224
+ - SetFit: 1.1.0
225
+ - Sentence Transformers: 3.0.1
226
+ - Transformers: 4.45.2
227
+ - PyTorch: 2.4.0+cu124
228
+ - Datasets: 2.21.0
229
+ - Tokenizers: 0.20.0
230
+
231
+ ## Citation
232
+
233
+ ### BibTeX
234
+ ```bibtex
235
+ @article{https://doi.org/10.48550/arxiv.2209.11055,
236
+ doi = {10.48550/ARXIV.2209.11055},
237
+ url = {https://arxiv.org/abs/2209.11055},
238
+ author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
239
+ keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
240
+ title = {Efficient Few-Shot Learning Without Prompts},
241
+ publisher = {arXiv},
242
+ year = {2022},
243
+ copyright = {Creative Commons Attribution 4.0 International}
244
+ }
245
+ ```
246
+
247
+ <!--
248
+ ## Glossary
249
+
250
+ *Clearly define terms in order to be accessible across audiences.*
251
+ -->
252
+
253
+ <!--
254
+ ## Model Card Authors
255
+
256
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
257
+ -->
258
+
259
+ <!--
260
+ ## Model Card Contact
261
+
262
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
263
+ -->
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "sentence-transformers/all-mpnet-base-v2",
3
+ "architectures": [
4
+ "MPNetModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "mpnet",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 1,
20
+ "relative_attention_num_buckets": 32,
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.45.2",
23
+ "vocab_size": 30527
24
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.0.1",
4
+ "transformers": "4.45.2",
5
+ "pytorch": "2.4.0+cu124"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
config_setfit.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "normalize_embeddings": false,
3
+ "labels": [
4
+ "no",
5
+ "yes"
6
+ ]
7
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2ea6e97f65b9168ec1a1391f8c57b340d88a63c4484ae99937ee014e5337786a
3
+ size 437967672
model_head.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc0abe982a6de9bc744bd3cc2821a13f235af57d4c39c61f9ef07dd043c388a6
3
+ size 7023
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 384,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "[UNK]",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "104": {
36
+ "content": "[UNK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "30526": {
44
+ "content": "<mask>",
45
+ "lstrip": true,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ }
51
+ },
52
+ "bos_token": "<s>",
53
+ "clean_up_tokenization_spaces": false,
54
+ "cls_token": "<s>",
55
+ "do_lower_case": true,
56
+ "eos_token": "</s>",
57
+ "mask_token": "<mask>",
58
+ "max_length": 128,
59
+ "model_max_length": 384,
60
+ "pad_to_multiple_of": null,
61
+ "pad_token": "<pad>",
62
+ "pad_token_type_id": 0,
63
+ "padding_side": "right",
64
+ "sep_token": "</s>",
65
+ "stride": 0,
66
+ "strip_accents": null,
67
+ "tokenize_chinese_chars": true,
68
+ "tokenizer_class": "MPNetTokenizer",
69
+ "truncation_side": "right",
70
+ "truncation_strategy": "longest_first",
71
+ "unk_token": "[UNK]"
72
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff