Text2Text Generation
Transformers
Safetensors
mt5
Inference Endpoints
File size: 2,244 Bytes
3dde14c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95c22d4
3dde14c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95c22d4
3dde14c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
license: cc-by-4.0
language:
- am
- ru
- en
- uk
- de
- ar
- zh
- es
- hi
datasets:
- s-nlp/ru_paradetox
- s-nlp/paradetox
- textdetox/multilingual_paradetox
library_name: transformers
pipeline_tag: text2text-generation
---

 # mT0-XL-detox-orpo

 **Resources**:
* [Paper]()
* [GitHub with training scripts and data](https://github.com/s-nlp/multilingual-transformer-detoxification)

 ## Model Information
This is a multilingual 3.7B text detoxification model built on [TextDetox 2024 shared task](https://pan.webis.de/clef24/pan24-web/text-detoxification.html) based on [mT0-xl](https://huggingface.co/bigscience/mt0-xl). The model was trained in a two-step setup: the first step is full fine-tuning on different parallel text detoxification datasets, and the second step is ORPO alignment on a self-annotated preference dataset collected using toxicity and similarity classifiers. See the paper for more details.

 ## Example usage

 ```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

 model = AutoModelForSeq2SeqLM.from_pretrained('s-nlp/mt0-xl-detox-orpo', device_map="auto")
tokenizer = AutoTokenizer.from_pretrained('s-nlp/mt0-xl-detox-orpo')

 LANG_PROMPTS = {
    'zh': '排毒:',
    'es': 'Desintoxicar: ',
    'ru': 'Детоксифицируй: ',
    'ar': 'إزالة السموم: ',
    'hi': 'विषहरण: ',
    'uk': 'Детоксифікуй: ',
    'de': 'Entgiften: ',
    'am': 'መርዝ መርዝ: ',
    'en': 'Detoxify: ',
}

 def detoxify(text, lang, model, tokenizer):
    encodings = tokenizer(LANG_PROMPTS[lang] + text, return_tensors='pt').to(model.device)
    
    outputs = model.generate(**encodings.to(model.device), 
                             max_length=128,
                             num_beams=10,
                             no_repeat_ngram_size=3,
                             repetition_penalty=1.2,
                             num_beam_groups=5,
                             diversity_penalty=2.5,
                             num_return_sequences=5,
                             early_stopping=True,
                             )
    
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)
```

 ## Human evaluation


 ## Automatic evaluation