Sfekih commited on
Commit
f74e26f
1 Parent(s): dbd6d67

Add SetFit model

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,212 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
3
+ library_name: setfit
4
+ metrics:
5
+ - f1
6
+ pipeline_tag: text-classification
7
+ tags:
8
+ - setfit
9
+ - sentence-transformers
10
+ - text-classification
11
+ - generated_from_setfit_trainer
12
+ widget:
13
+ - text: Politically Motivated Murders Increased by 80% in October On November 7, news
14
+ outlets reported that murders due to political violence in Colombia increased
15
+ in October by 80%, according to the Resource Center for the Analysis of Conflicts.
16
+ - text: Ils auraient menacé la femme d’un VDP et fouiller leur avant de repartir avec
17
+ une arme.
18
+ - text: En rappel, cette décision de la réouverture des points de vente de céréales
19
+ au profit des personnes vulnerables pendant le premier trimestre de 2021, a été
20
+ prise par le Conseil des ministres du 24 février 2021.
21
+ - text: IRC clinics have seen double the number of patients this month due to increasing
22
+ pressure on other facilities where there are PPE shortages or a reduction in health
23
+ staff who have had to self-isolate as a precaution.
24
+ - text: Según los hallazgos de las instituciones que participaron en la misión recientemente,
25
+ se conoció que las comunidades que continúan en el resguardo están en riesgo de
26
+ desplazamiento hacia Montería, debido a la continuidad de combates, operaciones
27
+ militares y presencia activa del GDO.
28
+ inference: true
29
+ model-index:
30
+ - name: SetFit with sentence-transformers/paraphrase-multilingual-mpnet-base-v2
31
+ results:
32
+ - task:
33
+ type: text-classification
34
+ name: Text Classification
35
+ dataset:
36
+ name: Unknown
37
+ type: unknown
38
+ split: test
39
+ metrics:
40
+ - type: f1
41
+ value: 0.7804878048780488
42
+ name: F1
43
+ ---
44
+
45
+ # SetFit with sentence-transformers/paraphrase-multilingual-mpnet-base-v2
46
+
47
+ This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) as the Sentence Transformer embedding model. A [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.
48
+
49
+ The model has been trained using an efficient few-shot learning technique that involves:
50
+
51
+ 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
52
+ 2. Training a classification head with features from the fine-tuned Sentence Transformer.
53
+
54
+ ## Model Details
55
+
56
+ ### Model Description
57
+ - **Model Type:** SetFit
58
+ - **Sentence Transformer body:** [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2)
59
+ - **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
60
+ - **Maximum Sequence Length:** 128 tokens
61
+ - **Number of Classes:** 2 classes
62
+ <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
63
+ <!-- - **Language:** Unknown -->
64
+ <!-- - **License:** Unknown -->
65
+
66
+ ### Model Sources
67
+
68
+ - **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
69
+ - **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
70
+ - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
71
+
72
+ ### Model Labels
73
+ | Label | Examples |
74
+ |:------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
75
+ | 1 | <ul><li>'Meanwhile in Mentao camp, access had been cut off for more than a year, following a series of attacks.'</li><li>'La Fundación Desarrollo y Paz (Fundepaz) hizo pública este domingo, la amenaza que recibió el municipio de Cumbales, departamento de Nariño, por un supuesto panfleto atribuido al ELN en el que lanzan varias amenazas en contra de la población civil y ordenan restricciones a la movilidad.'</li><li>'Most focal points (82%) continue to report that living conditions have worsened for their communities since the beginning of the pandemic, more so in low-density areas (96%).All focal points in SDF and 94% in GoS areas say this, but in NSAG/TBAF areas, just over half report little change in people’s ability to meet their needs, and only 42% note a deterioration.'</li></ul> |
76
+ | 0 | <ul><li>'The murders increased from 10 in August to 18 in October.'</li><li>'Cette tendance s’explique par le fait que ces trois (3) régions ont connu des attaques terroristes et en subissent les conséquences de l’insécurité.'</li><li>'In addition to Protection, Child Protection, and GBV focal points, the Protection Sector has also activated its Protection Emergency Response Units to disseminate messages and ensure that the most vulnerable are reached.'</li></ul> |
77
+
78
+ ## Evaluation
79
+
80
+ ### Metrics
81
+ | Label | F1 |
82
+ |:--------|:-------|
83
+ | **all** | 0.7805 |
84
+
85
+ ## Uses
86
+
87
+ ### Direct Use for Inference
88
+
89
+ First install the SetFit library:
90
+
91
+ ```bash
92
+ pip install setfit
93
+ ```
94
+
95
+ Then you can load this model and run inference.
96
+
97
+ ```python
98
+ from setfit import SetFitModel
99
+
100
+ # Download from the 🤗 Hub
101
+ model = SetFitModel.from_pretrained("Sfekih/sentence_independancy_model")
102
+ # Run inference
103
+ preds = model("Ils auraient menacé la femme d’un VDP et fouiller leur avant de repartir avec une arme.")
104
+ ```
105
+
106
+ <!--
107
+ ### Downstream Use
108
+
109
+ *List how someone could finetune this model on their own dataset.*
110
+ -->
111
+
112
+ <!--
113
+ ### Out-of-Scope Use
114
+
115
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
116
+ -->
117
+
118
+ <!--
119
+ ## Bias, Risks and Limitations
120
+
121
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
122
+ -->
123
+
124
+ <!--
125
+ ### Recommendations
126
+
127
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
128
+ -->
129
+
130
+ ## Training Details
131
+
132
+ ### Training Set Metrics
133
+ | Training set | Min | Median | Max |
134
+ |:-------------|:----|:--------|:----|
135
+ | Word count | 3 | 24.4407 | 78 |
136
+
137
+ | Label | Training Sample Count |
138
+ |:------|:----------------------|
139
+ | 0 | 59 |
140
+ | 1 | 59 |
141
+
142
+ ### Training Hyperparameters
143
+ - batch_size: (32, 32)
144
+ - num_epochs: (1, 1)
145
+ - max_steps: -1
146
+ - sampling_strategy: oversampling
147
+ - num_iterations: 35
148
+ - body_learning_rate: (2e-05, 2e-05)
149
+ - head_learning_rate: 2e-05
150
+ - loss: CosineSimilarityLoss
151
+ - distance_metric: cosine_distance
152
+ - margin: 0.25
153
+ - end_to_end: False
154
+ - use_amp: False
155
+ - warmup_proportion: 0.1
156
+ - l2_weight: 0.01
157
+ - seed: 42
158
+ - eval_max_steps: -1
159
+ - load_best_model_at_end: False
160
+
161
+ ### Training Results
162
+ | Epoch | Step | Training Loss | Validation Loss |
163
+ |:------:|:----:|:-------------:|:---------------:|
164
+ | 0.0039 | 1 | 0.2854 | - |
165
+ | 0.1931 | 50 | 0.2645 | - |
166
+ | 0.3861 | 100 | 0.0945 | - |
167
+ | 0.5792 | 150 | 0.0022 | - |
168
+ | 0.7722 | 200 | 0.0008 | - |
169
+ | 0.9653 | 250 | 0.0006 | - |
170
+
171
+ ### Framework Versions
172
+ - Python: 3.10.12
173
+ - SetFit: 1.1.0
174
+ - Sentence Transformers: 3.1.1
175
+ - Transformers: 4.44.2
176
+ - PyTorch: 2.4.1+cu121
177
+ - Datasets: 3.0.1
178
+ - Tokenizers: 0.19.1
179
+
180
+ ## Citation
181
+
182
+ ### BibTeX
183
+ ```bibtex
184
+ @article{https://doi.org/10.48550/arxiv.2209.11055,
185
+ doi = {10.48550/ARXIV.2209.11055},
186
+ url = {https://arxiv.org/abs/2209.11055},
187
+ author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
188
+ keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
189
+ title = {Efficient Few-Shot Learning Without Prompts},
190
+ publisher = {arXiv},
191
+ year = {2022},
192
+ copyright = {Creative Commons Attribution 4.0 International}
193
+ }
194
+ ```
195
+
196
+ <!--
197
+ ## Glossary
198
+
199
+ *Clearly define terms in order to be accessible across audiences.*
200
+ -->
201
+
202
+ <!--
203
+ ## Model Card Authors
204
+
205
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
206
+ -->
207
+
208
+ <!--
209
+ ## Model Card Contact
210
+
211
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
212
+ -->
config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
3
+ "architectures": [
4
+ "XLMRobertaModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "gradient_checkpointing": false,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 768,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 3072,
16
+ "layer_norm_eps": 1e-05,
17
+ "max_position_embeddings": 514,
18
+ "model_type": "xlm-roberta",
19
+ "num_attention_heads": 12,
20
+ "num_hidden_layers": 12,
21
+ "output_past": true,
22
+ "pad_token_id": 1,
23
+ "position_embedding_type": "absolute",
24
+ "torch_dtype": "float32",
25
+ "transformers_version": "4.44.2",
26
+ "type_vocab_size": 1,
27
+ "use_cache": true,
28
+ "vocab_size": 250002
29
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.1.1",
4
+ "transformers": "4.44.2",
5
+ "pytorch": "2.4.1+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
config_setfit.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "labels": null,
3
+ "normalize_embeddings": false
4
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:947addd35bd056ba5d341ece63ea1fbd86968e47c186d47750f829202716a775
3
+ size 1112197096
model_head.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:64a1325c4ba131c26e47376887312c005bcb3e827e60eadef8ec4ff00e3828bb
3
+ size 7007
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 128,
3
+ "do_lower_case": false
4
+ }
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cad551d5600a84242d0973327029452a1e3672ba6313c2a3c3d69c4310e12719
3
+ size 17082987
tokenizer_config.json ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "mask_token": "<mask>",
49
+ "max_length": 128,
50
+ "model_max_length": 128,
51
+ "pad_to_multiple_of": null,
52
+ "pad_token": "<pad>",
53
+ "pad_token_type_id": 0,
54
+ "padding_side": "right",
55
+ "sep_token": "</s>",
56
+ "stride": 0,
57
+ "tokenizer_class": "XLMRobertaTokenizer",
58
+ "truncation_side": "right",
59
+ "truncation_strategy": "longest_first",
60
+ "unk_token": "<unk>"
61
+ }