Text2Text Generation
Transformers
Safetensors
mt5
Inference Endpoints
lmeribal commited on
Commit
3dde14c
1 Parent(s): e273819

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -3
README.md CHANGED
@@ -1,3 +1,71 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - am
5
+ - ru
6
+ - en
7
+ - uk
8
+ - de
9
+ - ar
10
+ - zh
11
+ - es
12
+ - hi
13
+ datasets:
14
+ - s-nlp/ru_paradetox
15
+ - s-nlp/paradetox
16
+ - textdetox/multilingual_paradetox
17
+ library_name: transformers
18
+ pipeline_tag: text2text-generation
19
+ ---
20
+
21
+ # mT0-XL-detox-orpo
22
+
23
+ **Resources**:
24
+
25
+ * [Paper]()
26
+ * [GitHub with training scripts and data](https://github.com/s-nlp/multilingual-transformer-detoxification)
27
+
28
+ ## Model Information
29
+ This is a multilingual 3.7B text detoxification model built on [TextDetox 2024 shared task](https://pan.webis.de/clef24/pan24-web/text-detoxification.html) based on [mT0-xl](https://huggingface.co/bigscience/mt0-xl). The model was trained in a two-step setup: the first step is full fine-tuning on different parallel text detoxification datasets, and the second step is ORPO alignment on a self-annotated preference dataset collected using toxicity and similarity classifiers. See the paper for more details.
30
+
31
+ ## Example usage
32
+
33
+ ```python
34
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
35
+
36
+ model = AutoModelForSeq2SeqLM.from_pretrained('s-nlp/mt0-xl-detox-orpo', device_map="auto")
37
+ tokenizer = AutoTokenizer.from_pretrained('s-nlp/mt0-xl-detox-orpo')
38
+
39
+ LANG_PROMPTS = {
40
+ 'zh': '排毒:',
41
+ 'es': 'Desintoxicar: ',
42
+ 'ru': 'Детоксифицируй: ',
43
+ 'ar': 'إزالة السموم: ',
44
+ 'hi': 'विषहरण: ',
45
+ 'uk': 'Детоксифікуй: ',
46
+ 'de': 'Entgiften: ',
47
+ 'am': 'መርዝ መርዝ: ',
48
+ 'en': 'Detoxify: ',
49
+ }
50
+
51
+ def detoxify(text, lang, model, tokenizer):
52
+ encodings = tokenizer(LANG_PROMPTS[lang] + input_text, return_tensors='pt').to(model.device)
53
+
54
+ outputs = model.generate(**encodings.to(model.device),
55
+ max_length=128,
56
+ num_beams=10,
57
+ no_repeat_ngram_size=3,
58
+ repetition_penalty=1.2,
59
+ num_beam_groups=5,
60
+ diversity_penalty=2.5,
61
+ num_return_sequences=5,
62
+ early_stopping=True,
63
+ )
64
+
65
+ return tokenizer.batch_decode(outputs, skip_special_tokens=True)
66
+ ```
67
+
68
+ ## Human evaluation
69
+
70
+
71
+ ## Automatic evaluation