eliasalbouzidi commited on
Commit
095f7f1
·
verified ·
1 Parent(s): 58c86de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -14
README.md CHANGED
@@ -12,11 +12,15 @@ widget:
12
  example_title: Nsfw
13
  - text: A mass shooting
14
  example_title: Nsfw
 
15
  license: apache-2.0
16
  language:
17
  - en
18
  metrics:
19
  - f1
 
 
 
20
  pipeline_tag: text-classification
21
  tags:
22
  - Transformers
@@ -26,7 +30,7 @@ tags:
26
  - distilbert
27
  ---
28
 
29
- # Model Card for Model ID
30
 
31
  <!-- Provide a quick summary of what the model is/does. -->
32
 
@@ -36,7 +40,7 @@ The model was trained using a dataset containing 190,000 labeled text samples, d
36
 
37
  The model is based on the Distilbert-base model.
38
 
39
- In terms of performance, the model has achieved a score of 0.97 for F1.
40
 
41
  To improve the performance of the model, it is necessary to preprocess the input text. You can refer to the preprocess function in the app.py file in the following space: <https://huggingface.co/spaces/eliasalbouzidi/distilbert-nsfw-text-classifier>.
42
  ### Model Description
@@ -44,26 +48,77 @@ To improve the performance of the model, it is necessary to preprocess the input
44
  The model can be used directly to classify text into one of the two classes. It takes in a string of text as input and outputs a probability distribution over the two classes. The class with the highest probability is selected as the predicted class.
45
 
46
 
47
-
48
  - **Developed by:** Centrale Supélec Students
49
  - **Model type:** 60M
50
  - **Language(s) (NLP):** English
51
  - **License:** apache-2.0
52
 
53
-
54
- ### Training Procedure
55
-
56
- The model was trained utilizing the Hugging Face Transformers library. The training approach involved fine-tuning the DistilBERT-base model. To optimize memory usage and accelerate training, mixed precision FP16 was used.
57
-
58
  ### Training Data
59
-
60
  The training data for finetuning the text classification model consists of a large corpus of text labeled with one of the two classes: "safe" and "nsfw". The dataset contains a total of 190,000 examples, which are distributed as follows:
61
 
62
- 100,000 examples labeled as "safe"
63
-
64
- 90,000 examples labeled as "nsfw"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
- The data was preprocessed for example to remove numbers, punctuation, urls ...
67
 
68
  ## Uses
69
 
@@ -80,7 +135,6 @@ The model may exhibit biases based on the training data used. It may not perform
80
 
81
  ### Recommendations
82
 
83
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
84
 
85
  Users should be aware of the limitations and biases of the model and use it accordingly. They should also be prepared to handle false positives and false negatives. It is recommended to fine-tune the model for specific downstream tasks and to evaluate its performance on relevant datasets.
86
 
 
12
  example_title: Nsfw
13
  - text: A mass shooting
14
  example_title: Nsfw
15
+ base_model: distilbert-base-uncased
16
  license: apache-2.0
17
  language:
18
  - en
19
  metrics:
20
  - f1
21
+ - accuracy
22
+ - precision
23
+ - recall
24
  pipeline_tag: text-classification
25
  tags:
26
  - Transformers
 
30
  - distilbert
31
  ---
32
 
33
+ # Model Card
34
 
35
  <!-- Provide a quick summary of what the model is/does. -->
36
 
 
40
 
41
  The model is based on the Distilbert-base model.
42
 
43
+ In terms of performance, the model has achieved a score of 0.974 for F1 (40K exemples).
44
 
45
  To improve the performance of the model, it is necessary to preprocess the input text. You can refer to the preprocess function in the app.py file in the following space: <https://huggingface.co/spaces/eliasalbouzidi/distilbert-nsfw-text-classifier>.
46
  ### Model Description
 
48
  The model can be used directly to classify text into one of the two classes. It takes in a string of text as input and outputs a probability distribution over the two classes. The class with the highest probability is selected as the predicted class.
49
 
50
 
 
51
  - **Developed by:** Centrale Supélec Students
52
  - **Model type:** 60M
53
  - **Language(s) (NLP):** English
54
  - **License:** apache-2.0
55
 
 
 
 
 
 
56
  ### Training Data
 
57
  The training data for finetuning the text classification model consists of a large corpus of text labeled with one of the two classes: "safe" and "nsfw". The dataset contains a total of 190,000 examples, which are distributed as follows:
58
 
59
+ 117,000 examples labeled as "safe"
60
+
61
+ 63,000 examples labeled as "nsfw"
62
+
63
+ It was assembled by scraping data from the web and utilizing existing open-source datasets. A significant portion of the dataset consists of descriptions for images and scenes. The primary objective was to prevent diffusers from generating NSFW content but it can be used for other moderation purposes.
64
+
65
+ ### Training hyperparameters
66
+
67
+ The following hyperparameters were used during training:
68
+ - learning_rate: 1e-05
69
+ - train_batch_size: 32
70
+ - eval_batch_size: 32
71
+ - seed: 42
72
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
73
+ - lr_scheduler_type: linear
74
+ - lr_scheduler_warmup_steps: 600
75
+ - num_epochs: 3
76
+ - mixed_precision_training: Native AMP
77
+
78
+ ### Training results
79
+
80
+ | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 | Fbeta 1.6 | False positive rate | False negative rate | Precision | Recall |
81
+ |:-------------:|:------:|:-----:|:---------------:|:--------:|:------:|:---------:|:-------------------:|:-------------------:|:---------:|:------:|
82
+ | 0.3367 | 0.0998 | 586 | 0.1227 | 0.9586 | 0.9448 | 0.9447 | 0.0331 | 0.0554 | 0.9450 | 0.9446 |
83
+ | 0.0998 | 0.1997 | 1172 | 0.0919 | 0.9705 | 0.9606 | 0.9595 | 0.0221 | 0.0419 | 0.9631 | 0.9581 |
84
+ | 0.0896 | 0.2995 | 1758 | 0.0900 | 0.9730 | 0.9638 | 0.9600 | 0.0163 | 0.0448 | 0.9724 | 0.9552 |
85
+ | 0.087 | 0.3994 | 2344 | 0.0820 | 0.9743 | 0.9657 | 0.9646 | 0.0191 | 0.0367 | 0.9681 | 0.9633 |
86
+ | 0.0806 | 0.4992 | 2930 | 0.0717 | 0.9752 | 0.9672 | 0.9713 | 0.0256 | 0.0235 | 0.9582 | 0.9765 |
87
+ | 0.0741 | 0.5991 | 3516 | 0.0741 | 0.9753 | 0.9674 | 0.9712 | 0.0251 | 0.0240 | 0.9589 | 0.9760 |
88
+ | 0.0747 | 0.6989 | 4102 | 0.0689 | 0.9773 | 0.9697 | 0.9696 | 0.0181 | 0.0305 | 0.9699 | 0.9695 |
89
+ | 0.0707 | 0.7988 | 4688 | 0.0738 | 0.9781 | 0.9706 | 0.9678 | 0.0137 | 0.0356 | 0.9769 | 0.9644 |
90
+ | 0.0644 | 0.8986 | 5274 | 0.0682 | 0.9796 | 0.9728 | 0.9708 | 0.0135 | 0.0317 | 0.9773 | 0.9683 |
91
+ | 0.0688 | 0.9985 | 5860 | 0.0658 | 0.9798 | 0.9730 | 0.9718 | 0.0144 | 0.0298 | 0.9758 | 0.9702 |
92
+ | 0.0462 | 1.0983 | 6446 | 0.0682 | 0.9800 | 0.9733 | 0.9723 | 0.0146 | 0.0290 | 0.9756 | 0.9710 |
93
+ | 0.0498 | 1.1982 | 7032 | 0.0706 | 0.9800 | 0.9733 | 0.9717 | 0.0138 | 0.0303 | 0.9768 | 0.9697 |
94
+ | 0.0484 | 1.2980 | 7618 | 0.0773 | 0.9797 | 0.9728 | 0.9696 | 0.0117 | 0.0345 | 0.9802 | 0.9655 |
95
+ | 0.0483 | 1.3979 | 8204 | 0.0676 | 0.9800 | 0.9734 | 0.9742 | 0.0172 | 0.0248 | 0.9715 | 0.9752 |
96
+ | 0.0481 | 1.4977 | 8790 | 0.0678 | 0.9798 | 0.9731 | 0.9737 | 0.0170 | 0.0255 | 0.9717 | 0.9745 |
97
+ | 0.0474 | 1.5975 | 9376 | 0.0665 | 0.9782 | 0.9713 | 0.9755 | 0.0234 | 0.0191 | 0.9618 | 0.9809 |
98
+ | 0.0432 | 1.6974 | 9962 | 0.0691 | 0.9787 | 0.9718 | 0.9748 | 0.0213 | 0.0213 | 0.9651 | 0.9787 |
99
+ | 0.0439 | 1.7972 | 10548 | 0.0683 | 0.9811 | 0.9748 | 0.9747 | 0.0150 | 0.0254 | 0.9750 | 0.9746 |
100
+ | 0.0442 | 1.8971 | 11134 | 0.0710 | 0.9809 | 0.9744 | 0.9719 | 0.0118 | 0.0313 | 0.9802 | 0.9687 |
101
+ | 0.0425 | 1.9969 | 11720 | 0.0671 | 0.9810 | 0.9747 | 0.9756 | 0.0165 | 0.0232 | 0.9726 | 0.9768 |
102
+ | 0.0299 | 2.0968 | 12306 | 0.0723 | 0.9802 | 0.9738 | 0.9758 | 0.0187 | 0.0217 | 0.9692 | 0.9783 |
103
+ | 0.0312 | 2.1966 | 12892 | 0.0790 | 0.9804 | 0.9738 | 0.9731 | 0.0146 | 0.0279 | 0.9755 | 0.9721 |
104
+ | 0.0266 | 2.2965 | 13478 | 0.0840 | 0.9815 | 0.9752 | 0.9728 | 0.0115 | 0.0302 | 0.9806 | 0.9698 |
105
+ | 0.0277 | 2.3963 | 14064 | 0.0742 | 0.9808 | 0.9746 | 0.9770 | 0.0188 | 0.0199 | 0.9690 | 0.9801 |
106
+ | 0.0294 | 2.4962 | 14650 | 0.0764 | 0.9809 | 0.9747 | 0.9765 | 0.0179 | 0.0211 | 0.9705 | 0.9789 |
107
+ | 0.0304 | 2.5960 | 15236 | 0.0795 | 0.9811 | 0.9748 | 0.9742 | 0.0142 | 0.0266 | 0.9763 | 0.9734 |
108
+ | 0.0287 | 2.6959 | 15822 | 0.0783 | 0.9814 | 0.9751 | 0.9741 | 0.0134 | 0.0272 | 0.9775 | 0.9728 |
109
+ | 0.0267 | 2.7957 | 16408 | 0.0805 | 0.9814 | 0.9751 | 0.9740 | 0.0133 | 0.0274 | 0.9777 | 0.9726 |
110
+ | 0.0318 | 2.8956 | 16994 | 0.0767 | 0.9814 | 0.9752 | 0.9756 | 0.0154 | 0.0240 | 0.9744 | 0.9760 |
111
+ | 0.0305 | 2.9954 | 17580 | 0.0779 | 0.9815 | 0.9753 | 0.9751 | 0.0146 | 0.0251 | 0.9757 | 0.9749 |
112
+
113
+ We selected the checkpoint with the highest F-beta1.6 score.
114
+
115
+ ### Framework versions
116
+
117
+ - Transformers 4.40.1
118
+ - Pytorch 2.3.0+cu121
119
+ - Datasets 2.19.0
120
+ - Tokenizers 0.19.1
121
 
 
122
 
123
  ## Uses
124
 
 
135
 
136
  ### Recommendations
137
 
 
138
 
139
  Users should be aware of the limitations and biases of the model and use it accordingly. They should also be prepared to handle false positives and false negatives. It is recommended to fine-tune the model for specific downstream tasks and to evaluate its performance on relevant datasets.
140