|
--- |
|
library_name: transformers |
|
language: |
|
- en |
|
- fr |
|
- it |
|
- es |
|
- ru |
|
- uk |
|
- tt |
|
- ar |
|
- hi |
|
- ja |
|
- zh |
|
- he |
|
- am |
|
- de |
|
license: openrail++ |
|
datasets: |
|
- textdetox/multilingual_toxicity_dataset |
|
metrics: |
|
- f1 |
|
base_model: |
|
- FacebookAI/xlm-roberta-large |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
## Multilingual Toxicity Classifier for 15 Languages (2025) |
|
|
|
This is an instance of [xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) that was fine-tuned on binary toxicity classification task based on our updated (2025) dataset [textdetox/multilingual_toxicity_dataset](https://huggingface.co/datasets/textdetox/multilingual_toxicity_dataset). |
|
|
|
Now, the models covers 15 languages from various language families: |
|
|
|
| Language | Code | F1 Score | |
|
|-----------|------|---------| |
|
| English | en | 0.9225 | |
|
| Russian | ru | 0.9525 | |
|
| Ukrainian | uk | 0.96 | |
|
| German | de | 0.7325 | |
|
| Spanish | es | 0.7125 | |
|
| Arabic | ar | 0.6625 | |
|
| Amharic | am | 0.5575 | |
|
| Hindi | hi | 0.9725 | |
|
| Chinese | zh | 0.9175 | |
|
| Italian | it | 0.5864 | |
|
| French | fr | 0.9235 | |
|
| Hinglish | hin | 0.61 | |
|
| Hebrew | he | 0.8775 | |
|
| Japanese | ja | 0.8773 | |
|
| Tatar | tt | 0.5744 | |
|
|
|
## How to use |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('textdetox/xlmr-large-toxicity-classifier-v2') |
|
model = AutoModelForSequenceClassification.from_pretrained('textdetox/xlmr-large-toxicity-classifier-v2') |
|
|
|
batch = tokenizer.encode("You are amazing!", return_tensors="pt") |
|
|
|
output = model(batch) |
|
# idx 0 for neutral, idx 1 for toxic |
|
``` |
|
|
|
## Citation |
|
The model is prepared for [TextDetox 2025 Shared Task](https://pan.webis.de/clef25/pan25-web/text-detoxification.html) evaluation. |
|
|
|
Citation TBD soon. |