metadata
library_name: transformers
language:
- en
- fr
- it
- es
- ru
- uk
- tt
- ar
- hi
- ja
- zh
- he
- am
- de
license: openrail++
datasets:
- textdetox/multilingual_toxicity_dataset
metrics:
- f1
base_model:
- FacebookAI/xlm-roberta-large
pipeline_tag: text-classification
Multilingual Toxicity Classifier for 15 Languages (2025)
This is an instance of xlm-roberta-large that was fine-tuned on binary toxicity classification task based on our updated (2025) dataset textdetox/multilingual_toxicity_dataset.
Now, the models covers 15 languages from various language families:
Language | Code | F1 Score |
---|---|---|
English | en | 0.9225 |
Russian | ru | 0.9525 |
Ukrainian | uk | 0.96 |
German | de | 0.7325 |
Spanish | es | 0.7125 |
Arabic | ar | 0.6625 |
Amharic | am | 0.5575 |
Hindi | hi | 0.9725 |
Chinese | zh | 0.9175 |
Italian | it | 0.5864 |
French | fr | 0.9235 |
Hinglish | hin | 0.61 |
Hebrew | he | 0.8775 |
Japanese | ja | 0.8773 |
Tatar | tt | 0.5744 |
How to use
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('textdetox/xlmr-large-toxicity-classifier-v2')
model = AutoModelForSequenceClassification.from_pretrained('textdetox/xlmr-large-toxicity-classifier-v2')
batch = tokenizer.encode("You are amazing!", return_tensors="pt")
output = model(batch)
# idx 0 for neutral, idx 1 for toxic
Citation
The model is prepared for TextDetox 2025 Shared Task evaluation.
Citation TBD soon.