Update README.md
Browse files
README.md
CHANGED
@@ -3,6 +3,9 @@ license: apache-2.0
|
|
3 |
base_model: distilroberta-base
|
4 |
tags:
|
5 |
- generated_from_trainer
|
|
|
|
|
|
|
6 |
metrics:
|
7 |
- accuracy
|
8 |
- recall
|
@@ -11,14 +14,23 @@ metrics:
|
|
11 |
model-index:
|
12 |
- name: distilroberta-base-rejection-v1
|
13 |
results: []
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
---
|
15 |
|
16 |
-
|
17 |
-
should probably proofread and complete it, then remove this comment. -->
|
18 |
|
19 |
-
|
|
|
|
|
20 |
|
21 |
-
This model is a fine-tuned version of [distilroberta-base](https://huggingface.co/distilroberta-base) on an unknown dataset.
|
22 |
It achieves the following results on the evaluation set:
|
23 |
- Loss: 0.0544
|
24 |
- Accuracy: 0.9887
|
@@ -26,17 +38,79 @@ It achieves the following results on the evaluation set:
|
|
26 |
- Precision: 0.9279
|
27 |
- F1: 0.9537
|
28 |
|
29 |
-
## Model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
-
|
32 |
|
33 |
-
|
|
|
|
|
34 |
|
35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
|
37 |
## Training and evaluation data
|
38 |
|
39 |
-
|
|
|
|
|
|
|
|
|
|
|
40 |
|
41 |
## Training procedure
|
42 |
|
@@ -60,10 +134,28 @@ The following hyperparameters were used during training:
|
|
60 |
| 0.0219 | 2.0 | 7072 | 0.0312 | 0.9919 | 0.9917 | 0.9434 | 0.9669 |
|
61 |
| 0.0121 | 3.0 | 10608 | 0.0350 | 0.9939 | 0.9905 | 0.9596 | 0.9748 |
|
62 |
|
63 |
-
|
64 |
### Framework versions
|
65 |
|
66 |
- Transformers 4.36.2
|
67 |
- Pytorch 2.1.2+cu121
|
68 |
- Datasets 2.16.1
|
69 |
- Tokenizers 0.15.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
base_model: distilroberta-base
|
4 |
tags:
|
5 |
- generated_from_trainer
|
6 |
+
- rejection
|
7 |
+
- no_answer
|
8 |
+
- chatgpt
|
9 |
metrics:
|
10 |
- accuracy
|
11 |
- recall
|
|
|
14 |
model-index:
|
15 |
- name: distilroberta-base-rejection-v1
|
16 |
results: []
|
17 |
+
language:
|
18 |
+
- en
|
19 |
+
pipeline_tag: text-classification
|
20 |
+
co2_eq_emissions:
|
21 |
+
emissions: 0.07987621556153969
|
22 |
+
source: code carbon
|
23 |
+
training_type: fine-tuning
|
24 |
+
datasets:
|
25 |
+
- argilla/notus-uf-dpo-closest-rejected
|
26 |
---
|
27 |
|
28 |
+
# Model Card for distilroberta-base-rejection-v1
|
|
|
29 |
|
30 |
+
This model is a fine-tuned version of [distilroberta-base](https://huggingface.co/distilroberta-base) on multiple combined datasets of rejections from different LLMs and normal responses from RLHF datasets.
|
31 |
+
|
32 |
+
It aims to identify rejections in LLMs when the prompt doesn't pass content moderation, classifying inputs into two categories: `0` for normal outputs and `1` for rejection detected.
|
33 |
|
|
|
34 |
It achieves the following results on the evaluation set:
|
35 |
- Loss: 0.0544
|
36 |
- Accuracy: 0.9887
|
|
|
38 |
- Precision: 0.9279
|
39 |
- F1: 0.9537
|
40 |
|
41 |
+
## Model details
|
42 |
+
|
43 |
+
- **Fine-tuned by:** Laiyer.ai
|
44 |
+
- **Model type:** distilroberta-base
|
45 |
+
- **Language(s) (NLP):** English
|
46 |
+
- **License:** Apache license 2.0
|
47 |
+
- **Finetuned from model:** [distilroberta-base](https://huggingface.co/distilroberta-base)
|
48 |
+
|
49 |
+
## Intended Uses & Limitations
|
50 |
+
|
51 |
+
It aims to identify rejection, classifying inputs into two categories: `0` for normal output and `1` for rejection detected.
|
52 |
+
|
53 |
+
The model's performance is dependent on the nature and quality of the training data. It might not perform well on text styles or topics not represented in the training set.
|
54 |
+
|
55 |
+
Additionally, `distilroberta-base` is case-sensitive model.
|
56 |
+
|
57 |
+
## How to Get Started with the Model
|
58 |
+
|
59 |
+
### Transformers
|
60 |
+
|
61 |
+
```python
|
62 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
63 |
+
import torch
|
64 |
+
|
65 |
+
tokenizer = AutoTokenizer.from_pretrained("laiyer/distilroberta-base-rejection-v1")
|
66 |
+
model = AutoModelForSequenceClassification.from_pretrained("laiyer/distilroberta-base-rejection-v1")
|
67 |
+
|
68 |
+
classifier = pipeline(
|
69 |
+
"text-classification",
|
70 |
+
model=model,
|
71 |
+
tokenizer=tokenizer,
|
72 |
+
truncation=True,
|
73 |
+
max_length=512,
|
74 |
+
device=torch.device("cuda" if torch.cuda.is_available() else "CPU"),
|
75 |
+
)
|
76 |
+
|
77 |
+
print(classifier("Sorry, but I can't assist with that."))
|
78 |
+
```
|
79 |
+
|
80 |
+
### Optimum with ONNX
|
81 |
|
82 |
+
Loading the model requires the [🤗 Optimum](https://huggingface.co/docs/optimum/index) library installed.
|
83 |
|
84 |
+
```python
|
85 |
+
from optimum.onnxruntime import ORTModelForSequenceClassification
|
86 |
+
from transformers import AutoTokenizer, pipeline
|
87 |
|
88 |
+
tokenizer = AutoTokenizer.from_pretrained("laiyer/distilroberta-base-rejection-v1", subfolder="onnx")
|
89 |
+
model = ORTModelForSequenceClassification.from_pretrained("laiyer/distilroberta-base-rejection-v1", export=False, subfolder="onnx")
|
90 |
+
|
91 |
+
classifier = pipeline(
|
92 |
+
task="text-classification",
|
93 |
+
model=model,
|
94 |
+
tokenizer=tokenizer,
|
95 |
+
truncation=True,
|
96 |
+
max_length=512,
|
97 |
+
)
|
98 |
+
|
99 |
+
print(classifier("Sorry, but I can't assist with that."))
|
100 |
+
```
|
101 |
+
|
102 |
+
### Use in LLM Guard
|
103 |
+
|
104 |
+
[NoRefusal Scanner](https://llm-guard.com/output_scanners/no_refusal/) to detect if output was rejected, which can signal that something is going wrong with the prompt.
|
105 |
|
106 |
## Training and evaluation data
|
107 |
|
108 |
+
The model was trained on a custom dataset from multiple open-source ones. We used ~10% rejections and ~90% of normal outputs.
|
109 |
+
|
110 |
+
We used the following papers when preparing the datasets:
|
111 |
+
|
112 |
+
- [Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs](https://arxiv.org/abs/2308.13387)
|
113 |
+
- [I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box Generative Language Models](https://arxiv.org/abs/2306.03423)
|
114 |
|
115 |
## Training procedure
|
116 |
|
|
|
134 |
| 0.0219 | 2.0 | 7072 | 0.0312 | 0.9919 | 0.9917 | 0.9434 | 0.9669 |
|
135 |
| 0.0121 | 3.0 | 10608 | 0.0350 | 0.9939 | 0.9905 | 0.9596 | 0.9748 |
|
136 |
|
|
|
137 |
### Framework versions
|
138 |
|
139 |
- Transformers 4.36.2
|
140 |
- Pytorch 2.1.2+cu121
|
141 |
- Datasets 2.16.1
|
142 |
- Tokenizers 0.15.0
|
143 |
+
|
144 |
+
## Community
|
145 |
+
|
146 |
+
Join our Slack to give us feedback, connect with the maintainers and fellow users, ask questions,
|
147 |
+
get help for package usage or contributions, or engage in discussions about LLM security!
|
148 |
+
|
149 |
+
<a href="https://join.slack.com/t/laiyerai/shared_invite/zt-28jv3ci39-sVxXrLs3rQdaN3mIl9IT~w"><img src="https://github.com/laiyer-ai/llm-guard/blob/main/docs/assets/join-our-slack-community.png?raw=true" width="200"></a>
|
150 |
+
|
151 |
+
## Citation
|
152 |
+
|
153 |
+
```
|
154 |
+
@misc{distilroberta-base-rejection-v1,
|
155 |
+
author = {Laiyer.ai},
|
156 |
+
title = {Fine-Tuned DistilRoberta-Base for Rejection in the output Detection},
|
157 |
+
year = {2024},
|
158 |
+
publisher = {HuggingFace},
|
159 |
+
url = {https://huggingface.co/laiyer/distilroberta-base-rejection-v1},
|
160 |
+
}
|
161 |
+
```
|