--- library_name: transformers language: - en - fr - it - es - ru - uk - tt - ar - hi - ja - zh - he - am - de license: openrail++ datasets: - textdetox/multilingual_toxicity_dataset metrics: - f1 base_model: - cis-lmu/glot500-base pipeline_tag: text-classification tags: - toxic --- ## Multilingual Toxicity Classifier for 15 Languages (2025) This is an instance of [Glot500](https://huggingface.co/cis-lmu/glot500-base) that was fine-tuned on binary toxicity classification task based on our updated (2025) dataset [textdetox/multilingual_toxicity_dataset](https://huggingface.co/datasets/textdetox/multilingual_toxicity_dataset). Now, the models covers 15 languages from various language families: | Language | Code | F1 Score | |-----------|------|---------| | English | en | 0.9071 | | Russian | ru | 0.9022 | | Ukrainian | uk | 0.9075 | | German | de | 0.6528 | | Spanish | es | 0.7430 | | Arabic | ar | 0.6207 | | Amharic | am | 0.6676 | | Hindi | hi | 0.7171 | | Chinese | zh | 0.6483 | | Italian | it | 0.5975 | | French | fr | 0.9125 | | Hinglish | hin | 0.7051 | | Hebrew | he | 0.8911 | | Japanese | ja | 0.9058 | | Tatar | tt | 0.5834 | ## How to use ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained('textdetox/glot500-toxicity-classifier') model = AutoModelForSequenceClassification.from_pretrained('textdetox/glot500-toxicity-classifier') batch = tokenizer.encode("You are amazing!", return_tensors="pt") output = model(batch) # idx 0 for neutral, idx 1 for toxic ``` ## Citation The model is prepared for [TextDetox 2025 Shared Task](https://pan.webis.de/clef25/pan25-web/text-detoxification.html) evaluation. Citation TBD soon.