Text Classification
Transformers
Safetensors
xlm-roberta
dardem's picture
Update README.md
82cbc7b verified
metadata
library_name: transformers
language:
  - en
  - fr
  - it
  - es
  - ru
  - uk
  - tt
  - ar
  - hi
  - ja
  - zh
  - he
  - am
  - de
license: openrail++
datasets:
  - textdetox/multilingual_toxicity_dataset
metrics:
  - f1
base_model:
  - cardiffnlp/twitter-xlm-roberta-large-2022
pipeline_tag: text-classification

Multilingual Toxicity Classifier for 15 Languages (2025)

This is an instance of cardiffnlp/twitter-xlm-roberta-large-2022 that was fine-tuned on binary toxicity classification task based on our updated (2025) dataset textdetox/multilingual_toxicity_dataset.

Now, the models covers 15 languages from various language families:

Language Code F1 Score
English en 0.9071
Russian ru 0.9022
Ukrainian uk 0.9075
German de 0.6528
Spanish es 0.7430
Arabic ar 0.6207
Amharic am 0.6676
Hindi hi 0.7171
Chinese zh 0.6483
Italian it 0.7597
French fr 0.9114
Hinglish hin 0.7051
Hebrew he 0.8911
Japanese ja 0.8725
Tatar tt 0.6542

How to use

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('textdetox/twitter-xlmr-toxicity-classifier')
model = AutoModelForSequenceClassification.from_pretrained('textdetox/twitter-xlmr-toxicity-classifier')

batch = tokenizer.encode("You are amazing!", return_tensors="pt")

output = model(batch)
# idx 0 for neutral, idx 1 for toxic

Citation

The model is prepared for TextDetox 2025 Shared Task evaluation.

Citation TBD soon.