Description

NB: this version of the model is the improved version of EIStakovskii/french_toxicity_classifier_plus. To see the source code of training and the data please follow the github link.

This model was trained for toxicity labeling.

The model was fine-tuned based off the CamemBERT language model.

To use the model:

from transformers import pipeline

classifier = pipeline("text-classification", model = 'EIStakovskii/french_toxicity_classifier_plus_v2')

print(classifier("Foutez le camp d'ici!"))

Metrics (at validation):

epoch step eval_accuracy eval_f1 eval_loss
1.16 1600 0.9015412511332729 0.8968269048071442 0.3014959990978241

Comparison against Perspective

This model was compared against the Google's Perspective API that similarly detects toxicity. Two models were tested on two datasets: the size of 200 sentences and 400 sentences. The first one (arguably harder) was collected from the sentences of the JigSaw and DeTox datasets. The second one (easier) was collected from the combination of sources: both from JigSaw and DeTox as well as Paradetox translations and sentences extracted from Reverso Context by keywords.

french_toxicity_classifier_plus_v2

size accuracy f1
200 0.783 0.803
400 0.890 0.879

Perspective

size accuracy f1
200 0.826 0.795
**400 0.632 0.418

**I suspect that Perspective has such a low score in the case of the FR dataset (400) because it refuses to trigger on the words "merde" and "putain" and some more rarer words in French like "cul" and so on.

Downloads last month
60
Safetensors
Model size
111M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.