README.md · s-nlp/ruT5-base-detox at ab1138c8712b6e3d7d3fb7c588cf6b434853a071

metadata

license: openrail++
language:
  - ru
tags:
  - text-generation-inference
datasets:
  - s-nlp/ru_paradetox
base_model:
  - ai-forever/ruT5-base

This is the detoxification baseline model trained on the train part of "RUSSE 2022: Russian Text Detoxification Based on Parallel Corpora" competition. The source sentences are Russian toxic messages from Odnoklassniki, Pikabu, and Twitter platforms. The base model is ruT5.

How to use

from transformers import T5ForConditionalGeneration, AutoTokenizer

base_model_name = 'ai-forever/ruT5-base'
model_name = 's-nlp/ruT5-base-detox'

tokenizer = AutoTokenizer.from_pretrained(base_model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

input_ids = tokenizer.encode('Это полная хуйня!', return_tensors='pt')
output_ids = model.generate(input_ids, max_length=50, num_return_sequences=1)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)
# Это полный бред!

Citation

@article{dementievarusse,
  title={RUSSE-2022: Findings of the First Russian Detoxification Shared Task Based on Parallel Corpora},
  author={Dementieva, Daryna and Logacheva, Varvara and Nikishina, Irina and Fenogenova, Alena and Dale, David and Krotova, Irina and Semenov, Nikita and Shavrina, Tatiana and Panchenko, Alexander}
}

License

This model is licensed under the OpenRAIL++ License, which supports the development of various technologies—both industrial and academic—that serve the public good.