Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,71 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-4.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-4.0
|
3 |
+
language:
|
4 |
+
- am
|
5 |
+
- ru
|
6 |
+
- en
|
7 |
+
- uk
|
8 |
+
- de
|
9 |
+
- ar
|
10 |
+
- zh
|
11 |
+
- es
|
12 |
+
- hi
|
13 |
+
datasets:
|
14 |
+
- s-nlp/ru_paradetox
|
15 |
+
- s-nlp/paradetox
|
16 |
+
- textdetox/multilingual_paradetox
|
17 |
+
library_name: transformers
|
18 |
+
pipeline_tag: text2text-generation
|
19 |
+
---
|
20 |
+
|
21 |
+
# mT0-XL-detox-orpo
|
22 |
+
|
23 |
+
**Resources**:
|
24 |
+
|
25 |
+
* [Paper]()
|
26 |
+
* [GitHub with training scripts and data](https://github.com/s-nlp/multilingual-transformer-detoxification)
|
27 |
+
|
28 |
+
## Model Information
|
29 |
+
This is a multilingual 3.7B text detoxification model built on [TextDetox 2024 shared task](https://pan.webis.de/clef24/pan24-web/text-detoxification.html) based on [mT0-xl](https://huggingface.co/bigscience/mt0-xl). The model was trained in a two-step setup: the first step is full fine-tuning on different parallel text detoxification datasets, and the second step is ORPO alignment on a self-annotated preference dataset collected using toxicity and similarity classifiers. See the paper for more details.
|
30 |
+
|
31 |
+
## Example usage
|
32 |
+
|
33 |
+
```python
|
34 |
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
35 |
+
|
36 |
+
model = AutoModelForSeq2SeqLM.from_pretrained('s-nlp/mt0-xl-detox-orpo', device_map="auto")
|
37 |
+
tokenizer = AutoTokenizer.from_pretrained('s-nlp/mt0-xl-detox-orpo')
|
38 |
+
|
39 |
+
LANG_PROMPTS = {
|
40 |
+
'zh': '排毒:',
|
41 |
+
'es': 'Desintoxicar: ',
|
42 |
+
'ru': 'Детоксифицируй: ',
|
43 |
+
'ar': 'إزالة السموم: ',
|
44 |
+
'hi': 'विषहरण: ',
|
45 |
+
'uk': 'Детоксифікуй: ',
|
46 |
+
'de': 'Entgiften: ',
|
47 |
+
'am': 'መርዝ መርዝ: ',
|
48 |
+
'en': 'Detoxify: ',
|
49 |
+
}
|
50 |
+
|
51 |
+
def detoxify(text, lang, model, tokenizer):
|
52 |
+
encodings = tokenizer(LANG_PROMPTS[lang] + input_text, return_tensors='pt').to(model.device)
|
53 |
+
|
54 |
+
outputs = model.generate(**encodings.to(model.device),
|
55 |
+
max_length=128,
|
56 |
+
num_beams=10,
|
57 |
+
no_repeat_ngram_size=3,
|
58 |
+
repetition_penalty=1.2,
|
59 |
+
num_beam_groups=5,
|
60 |
+
diversity_penalty=2.5,
|
61 |
+
num_return_sequences=5,
|
62 |
+
early_stopping=True,
|
63 |
+
)
|
64 |
+
|
65 |
+
return tokenizer.batch_decode(outputs, skip_special_tokens=True)
|
66 |
+
```
|
67 |
+
|
68 |
+
## Human evaluation
|
69 |
+
|
70 |
+
|
71 |
+
## Automatic evaluation
|