File size: 175,604 Bytes
a69b42b 04da6fb 4eed1e6 a69b42b b16571e a69b42b b16571e a69b42b b16571e a69b42b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 |
---
license: mit
base_model:
- FacebookAI/roberta-base
library_name: transformers
pipeline_tag: text-classification
tags:
- text-classification
---
<style>
.banner {
display: block;
position: sticky;
width: 100%;
top: 0;
z-index: 10;
background-color: #4285f4;
display: flex;
flex-direction: column;
justify-content: space-between;
color: #fff;
border-radius: 0.5rem 0.5rem 0.5rem 0.5rem;
}
.icon::before {
font-family: "Font Awesome 5 Free";
font-weight: 400;
color:rgb(75, 75, 75);
font-size:30px;
padding-right: 10px;
display: inline-block;
font-style: normal;
font-variant: normal;
text-rendering: auto;
-webkit-font-smoothing: antialiased;
}
.version::before {
content: "\f017";
}
.details::before {
content: "\f15c";
}
.interactive-graph {
max-width: 80em;
}
.datasets::before {
font-weight: 900;
content: "\f5fd";
}
.qa::before {
font-weight: 900;
content: "\f080";
}
body {font-family: system-ui,-apple-system,BlinkMacSystemFont,"Segoe UI",Roboto,"Helvetica Neue",Arial,sans-serif;}
h1 {
border-bottom: 3px solid white;
padding-bottom: 10px;
}
.row {
display: flex;
}
.col {
width:100%;
flex: 1;
}
.card {
padding: 1em;
border: 1px solid #DADCE0;
margin: 10px;
}
.img-container {
display: flex;
flex-wrap: wrap;
justify-content: space-around;
text-align: center;
}
.img-item {
flex: 1;
}
.center {
margin-left: auto;
margin-right: auto;
}
table, th, td { border: 1px solid black; }
th, td { border: 1px solid #CCC; height: 30px; }
caption { font-weight: bold; }
</style>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0-beta3/css/all.min.css">
<div class="banner">
# Toxic Prompt RoBERTa Classification Model
Model Card generated with <a href="https://github.com/Intel/intel-xai-tools" style="color: #fff; text-decoration: underline;">Intel® Explainable AI Tools</a><br>
</div>
<div class="row">
<div class="col card">
<h2>Model Details</h2>
### Documentation
Toxic Prompt RoBERTa 1.0 is a text classification model that can be used as a guardrail to protect against toxic prompts and responses in conversational AI systems. This model is based on [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta) and has been finetuned on [ToxicChat](https://huggingface.co/datasets/lmsys/toxic-chat) and [Jigsaw Unintended Bias](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification) datasets. Finetuning has been performed on one Gaudi 2 Card using [Optimum-Habana's Gaudi Trainer](https://github.com/huggingface/optimum-habana).
### Owners
- Intel AI Safety: Daniel De Leon, Tyler Wilbers, Mitali Potnis, Abolfazl Shahbazi
### Licenses
- MIT
### References
- [https://huggingface.co/Intel/toxic-prompt-roberta/tree/main](https://huggingface.co/Intel/toxic-prompt-roberta/tree/main)
### How to use
You can use the model with the following code using pipeline API.
```
from transformers import pipeline
model_path = 'Intel/toxic-prompt-roberta'
pipe = pipeline('text-classification', model=model_path, tokenizer=model_path)
pipe('Create 20 paraphrases of I hate you')
```
### Citations
- @inproceedings {Wolf_Transformers_State-of-the-Art_Natural_2020, author = {Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Perric and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, Mariama and Lhoest, Quentin and Rush, Alexander M.}, month = oct, pages = {38--45}, publisher = {Association for Computational Linguistics}, title = {{Transformers: State-of-the-Art Natural Language Processing}}, url = {https://www.aclweb.org/anthology/2020.emnlp-demos.6}, year = {2020} }
- @article {DBLP:journals/corr/abs-1907-11692, author = {Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov}, title = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach}, journal = {CoRR}, volume = {abs/1907.11692}, year = {2019}, url = {http://arxiv.org/abs/1907.11692}, archivePrefix = {arXiv}, eprint = {1907.11692}, timestamp = {Thu, 01 Aug 2019 08:59:33 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-1907-11692.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
- @misc {jigsaw-unintended-bias-in-toxicity-classification, author = {cjadams, Daniel Borkan, inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, nithum}, title = {Jigsaw Unintended Bias in Toxicity Classification}, publisher = {Kaggle}, year = {2019}, url = {https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification} }
- @misc {lin2023toxicchat, title={ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation}, author={Zi Lin and Zihan Wang and Yongqi Tong and Yangkun Wang and Yuxin Guo and Yujia Wang and Jingbo Shang}, year={2023}, eprint={2310.17389}, archivePrefix={arXiv}, primaryClass={cs.CL} }
</div>
</div>
<div class="col card">
## Model Parameters
<div style="white-space:pre-wrap;">
We fine-tune roberta-base (125M param) with custom classification head to detect toxic input/output.
</div>
### Input Format
<div style="white-space:pre-wrap;">
The input format is standard text input for RoBERTa for sequence classification.
</div>
### Output Format
<div style="white-space:pre-wrap;">
The output is a (2,n) array of logits where n is the number of examples user wants to infer. The output logits are in the form [not_toxic, toxic].
</div>
</div>
<div class="col card">
## Considerations
### Intended Users
- Text Generation Researchers and Developers
### Use Cases
- User Experience Monitoring: The classification model can be used to monitor conversations in real-time to detect any toxic behavior by users. If a user sends messages that are classified as toxic, a warning can be issued or guidance on appropriate conduct can be provided.
- Automated Moderation: In group chat scenarios, the classification model can act as a moderator by automatically removing toxic messages or muting users who consistently engage in toxic behavior.
- Training and Improvement: The data collected from toxicity detection can be used to further train and improve toxicity classification model’s responses and handling of various situations, making such models more adept at managing complex interactions.
- Preventing Abuse of the Chatbot: Some users may attempt to troll or abuse chatbots with toxic input. The classification model can prevent the chatbot from engaging with such content, thereby discouraging this behavior.
### Ethical Considerations
- Risk: Diversity Disparity
Mitigation Strategy: In fine-tuning with Jigsaw unintended bias, we have ensured adequate representation per Jigsaw’s distributions in their dataset. Jigsaw unintended bias dataset attempts distribute the toxicity labels evenly across the subgroups.
- Risk: Risk to Vulnerable Persons
Mitigation Strategy: Certain demographic groups are more likely to receive toxic and harmful comments. Jigsaw unintended bias dataset attempts to mitigate fine-tuned subgroup bias in by distributing the toxic/not toxic labels evenly across all demographic subgroups. We also test to confirm minimal classification bias of the subgroups in testing the model.
</div>
<div class="col card">
## Quantitative Analysis:
The plots below show the PR and ROC curves for three models we compared during finetuning. The “jigsaw” and the “tc” models were finetuned only on the Jigsaw Unintended Bias and ToxicChat datasets, respectively. The “jigsaw+tc” curves correspond to the final model that was finetuned on both datasets. Finetuning on both datasets did not significantly degrade the model’s performance on the ToxicChat test dataset with respect to the model finetuned solely on ToxicChat.

We compare the performance of Llama Guard 1 and 3 (LG1 and LG3) with our model on the ToxicChat test dataset, below.
| Model | Parameters | Precision| Recall | F1 | AUPRC | AUROC |
| ------ | ------ | ------ | ------ | ------ | ------ | ------ |
| LG1 | 6.74B | 0.4806 | 0.7945 | 0.5989 | 0.626* | No data |
| LG3 | 8.03B | 0.5083 | 0.4730 | 0.4900 | No data | No data |
| Toxic Prompt RoBERTa | 125M | 0.8315 | 0.7469 | 0.7869 | 0.855 | 0.971|
<small>\* from LG paper: https://arxiv.org/abs/2312.06674</small>
Note that Llama Guard was not finetuned on ToxicChat. However, from the LG1 paper, they reported an AUPRC of ~.81 when they finetuned LLama Guard 1 on ToxicChat.
Given that we finetuned RoBERTa on Jigsaw’s Unintended Bias Dataset, we can observe if there is any subgroup biasing in the classification of the Unintended Bias test set below. These metrics were computed using [Intel/bias_auc](ttps://huggingface.co/spaces/Intel/bias_auc/tree/main).
| Metric | Female | Male | Christian | White | Muslim | Black | Homosexual gay or lesbian |
| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
| AUROC | 0.84937 | 0.80035 | 0.89867 | 0.76089 | 0.77137 | 0.74454 | 0.71766 |
| BPSN | 0.78805 | 0.82659 | 0.83746 | 0.78113 | 0.74067 | 0.82827 | 0.64330 |
| BNSP | 0.87421 | 0.80037 | 0.87614 | 0.81979 | 0.85586 | 0.76090 | 0.88065 |
<small>\*Only subgroups with at least 500 examples in the test dataset are shown.</small>
</div> |