|
--- |
|
datasets: |
|
- s-nlp/paradetox |
|
- s-nlp/ru_paradetox |
|
language: |
|
- ru |
|
- en |
|
library_name: transformers |
|
pipeline_tag: text2text-generation |
|
license: openrail++ |
|
--- |
|
|
|
## Model Description |
|
|
|
This is the model presented in the paper "Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification". |
|
|
|
The model is based on [mBART-large-50](https://huggingface.co/facebook/mbart-large-50) and trained on two parallel detoxification corpora: [ParaDetox](https://huggingface.co/datasets/s-nlp/paradetox) and [RuDetox](https://github.com/s-nlp/russe_detox_2022/tree/main/data). More details about this model are in the paper. |
|
|
|
|
|
## Usage |
|
|
|
1. Model loading. |
|
```python |
|
from transformers import MBartForConditionalGeneration, AutoTokenizer |
|
|
|
model = MBartForConditionalGeneration.from_pretrained("s-nlp/mbart-detox-en-ru").cuda() |
|
tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50") |
|
|
|
``` |
|
|
|
2. Detoxification utility. |
|
```python |
|
def paraphrase(text, model, tokenizer, n=None, max_length="auto", beams=3): |
|
texts = [text] if isinstance(text, str) else text |
|
inputs = tokenizer(texts, return_tensors="pt", padding=True)["input_ids"].to( |
|
model.device |
|
) |
|
if max_length == "auto": |
|
max_length = inputs.shape[1] + 10 |
|
|
|
result = model.generate( |
|
inputs, |
|
num_return_sequences=n or 1, |
|
do_sample=True, |
|
temperature=1.0, |
|
repetition_penalty=10.0, |
|
max_length=max_length, |
|
min_length=int(0.5 * max_length), |
|
num_beams=beams, |
|
forced_bos_token_id=tokenizer.lang_code_to_id[tokenizer.tgt_lang] |
|
) |
|
texts = [tokenizer.decode(r, skip_special_tokens=True) for r in result] |
|
|
|
if not n and isinstance(text, str): |
|
return texts[0] |
|
return texts |
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
TBD |