s-nlp
/

mbart-detox-en-ru

Text2Text Generation

Inference Endpoints

Model card Files Files and versions Community

mbart-detox-en-ru / README.md

dardem's picture

Update README.md

29f73bb about 1 year ago

|

history blame contribute delete

1.78 kB

	---
	datasets:
	- s-nlp/paradetox
	- s-nlp/ru_paradetox
	language:
	- ru
	- en
	library_name: transformers
	pipeline_tag: text2text-generation
	license: openrail++
	---

	## Model Description

	This is the model presented in the paper "Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification".

	The model is based on [mBART-large-50](https://huggingface.co/facebook/mbart-large-50) and trained on two parallel detoxification corpora: [ParaDetox](https://huggingface.co/datasets/s-nlp/paradetox) and [RuDetox](https://github.com/s-nlp/russe_detox_2022/tree/main/data). More details about this model are in the paper.


	## Usage

	1. Model loading.
	```python
	from transformers import MBartForConditionalGeneration, AutoTokenizer

	model = MBartForConditionalGeneration.from_pretrained("s-nlp/mbart-detox-en-ru").cuda()
	tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50")

	```

	2. Detoxification utility.
	```python
	def paraphrase(text, model, tokenizer, n=None, max_length="auto", beams=3):
	texts = [text] if isinstance(text, str) else text
	inputs = tokenizer(texts, return_tensors="pt", padding=True)["input_ids"].to(
	model.device
	)
	if max_length == "auto":
	max_length = inputs.shape[1] + 10

	result = model.generate(
	inputs,
	num_return_sequences=n or 1,
	do_sample=True,
	temperature=1.0,
	repetition_penalty=10.0,
	max_length=max_length,
	min_length=int(0.5 * max_length),
	num_beams=beams,
	forced_bos_token_id=tokenizer.lang_code_to_id[tokenizer.tgt_lang]
	)
	texts = [tokenizer.decode(r, skip_special_tokens=True) for r in result]

	if not n and isinstance(text, str):
	return texts[0]
	return texts
	```


	## Citation


	TBD